fsync method checking

Lists: pgsql-hackerspgsql-performance
From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: pgsql-performance(at)postgresql(dot)org
Subject: Solaris Performance (Again)
Date: 2003-12-10 05:56:38
Message-ID: 3FD6B596.8090803@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

This is a well-worn thread title - apologies, but these results seemed
interesting, and hopefully useful in the quest to get better performance
on Solaris:

I was curious to see if the rather uninspiring pgbench performance
obtained from a Sun 280R (see General: ATA Disks and RAID controllers
for database servers) could be improved if more time was spent
tuning.

With the help of a fellow workmate who is a bit of a Solaris guy, we
decided to have a go.

The major performance killer appeared to be mounting the filesystem with
the logging option. The next most significant seemed to be the choice of
sync_method for Pg - the default (open_datasync), which we initially
thought should be the best - appears noticeably slower than fdatasync.

We also tried changing some of the tuneable filesystem options using
tunefs - without any measurable effect.

Are Pg/Solaris folks running with logging on and sync_method default out
there ? - or have most of you been through this already ?

Pgbench Results (no. clients and transactions/s ) :

Setup 1: filesystem mounted with logging

No. tps
-----------
1 17
2 17
4 22
8 22
16 28
32 32
64 37

Setup 2: filesystem mounted without logging

No. tps
-----------
1 48
2 55
4 57
8 62
16 65
32 82
64 95

Setup 3 : filesystem mounted without logging, Pg sync_method = fdatasync

No. tps
-----------
1 89
2 94
4 95
8 93
16 99
32 115
64 122

Note : The Pgbench runs were conducted using -s 10 and -t 1000 -c 1->64,
2 - 3 runs of each setup were performed (averaged figures shown).

Mark


From: Jeff <threshar(at)torgo(dot)978(dot)org>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Solaris Performance (Again)
Date: 2003-12-10 13:53:23
Message-ID: 20031210085323.749a549f.threshar@torgo.978.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Wed, 10 Dec 2003 18:56:38 +1300
Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> wrote:

> The major performance killer appeared to be mounting the filesystem
> with the logging option. The next most significant seemed to be the
> choice of sync_method for Pg - the default (open_datasync), which we
> initially thought should be the best - appears noticeably slower than
> fdatasync.
>

Some interesting stuff, I'll have to play with it. Currently I'm pleased
with my solaris performance.

What version of PG?

If it is before 7.4 PG compiles with _NO_ optimization by default and
was a huge part of the slowness of PG on solaris.

--
Jeff Trout <jeff(at)jefftrout(dot)com>
http://www.jefftrout.com/
http://www.stuarthamm.net/


From: Neil Conway <neilc(at)samurai(dot)com>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Solaris Performance (Again)
Date: 2003-12-10 19:15:35
Message-ID: 87smjs4h7s.fsf@mailbox.samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
> Note : The Pgbench runs were conducted using -s 10 and -t 1000 -c
> 1->64, 2 - 3 runs of each setup were performed (averaged figures
> shown).

FYI, the pgbench docs state:

NOTE: scaling factor should be at least as large as the largest
number of clients you intend to test; else you'll mostly be
measuring update contention.

-Neil


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Jeff <threshar(at)torgo(dot)978(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Solaris Performance (Again)
Date: 2003-12-11 06:04:15
Message-ID: 3FD808DF.9080504@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Good point -

It is Pg 7.4beta1 , compiled with

CFLAGS += -O2 -funroll-loops -fexpensive-optimizations

Jeff wrote:

>
>What version of PG?
>
>If it is before 7.4 PG compiles with _NO_ optimization by default and
>was a huge part of the slowness of PG on solaris.
>
>
>
>


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Solaris Performance (Again)
Date: 2003-12-11 06:09:47
Message-ID: 3FD80A2B.2010601@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

yes - originally I was going to stop at 8 clients, but once the bit was
between the teeth....If I get another box to myself I will try -s 50 or
100 and see what that shows up.

cheers

Mark

Neil Conway wrote:

> FYI, the pgbench docs state:
>
> NOTE: scaling factor should be at least as large as the largest
> number of clients you intend to test; else you'll mostly be
> measuring update contention.
>
>-Neil
>
>
>


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: fsync method checking
Date: 2003-12-12 06:49:26
Message-ID: 200312120649.hBC6nQR15608@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Mark Kirkwood wrote:
> This is a well-worn thread title - apologies, but these results seemed
> interesting, and hopefully useful in the quest to get better performance
> on Solaris:
>
> I was curious to see if the rather uninspiring pgbench performance
> obtained from a Sun 280R (see General: ATA Disks and RAID controllers
> for database servers) could be improved if more time was spent
> tuning.
>
> With the help of a fellow workmate who is a bit of a Solaris guy, we
> decided to have a go.
>
> The major performance killer appeared to be mounting the filesystem with
> the logging option. The next most significant seemed to be the choice of
> sync_method for Pg - the default (open_datasync), which we initially
> thought should be the best - appears noticeably slower than fdatasync.

I thought the default was fdatasync, but looking at the code it seems
the default is open_datasync if O_DSYNC is available.

I assume the logic is that we usually do only one write() before
fsync(), so open_datasync should be faster. Why do we not use O_FSYNC
over fsync().

Looking at the code:

#if defined(O_SYNC)
#define OPEN_SYNC_FLAG O_SYNC
#else
#if defined(O_FSYNC)
#define OPEN_SYNC_FLAG O_FSYNC
#endif
#endif

#if defined(OPEN_SYNC_FLAG)
#if defined(O_DSYNC) && (O_DSYNC != OPEN_SYNC_FLAG)
#define OPEN_DATASYNC_FLAG O_DSYNC
#endif
#endif

#if defined(OPEN_DATASYNC_FLAG)
#define DEFAULT_SYNC_METHOD_STR "open_datasync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
#define DEFAULT_SYNC_FLAGBIT OPEN_DATASYNC_FLAG
#else
#if defined(HAVE_FDATASYNC)
#define DEFAULT_SYNC_METHOD_STR "fdatasync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
#define DEFAULT_SYNC_FLAGBIT 0
#else
#define DEFAULT_SYNC_METHOD_STR "fsync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC
#define DEFAULT_SYNC_FLAGBIT 0
#endif
#endif

I think the problem is that we prefer O_DSYNC over fdatasync, but do not
prefer O_FSYNC over fsync.

Running the attached test program shows on BSD/OS 4.3:

write 0.000360
write & fsync 0.001391
write, close & fsync 0.001308
open o_fsync, write 0.000924

showing O_FSYNC faster than fsync().

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

Attachment Content-Type Size
unknown_filename text/plain 2.6 KB

From: Manfred Spraul <manfred(at)colorfullife(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2003-12-12 20:54:34
Message-ID: 3FDA2B0A.1060709@colorfullife.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Bruce Momjian wrote:

> write 0.000360
> write & fsync 0.001391
> write, close & fsync 0.001308
> open o_fsync, write 0.000924
>
>
That's 1 milliseconds vs. 1.3 milliseconds. Neither value is realistic -
I guess the hw cache on and the os doesn't issue cache flush commands.
Realistic values are probably 5 ms vs 5.3 ms - 6%, not 30%. How large is
the syscall latency with BSD/OS 4.3?

One advantage of a seperate write and fsync call is better performance
for the writes that are triggered within AdvanceXLInsertBuffer: I'm not
sure how often that's necessary, but it's a write while holding both the
WALWriteLock and WALInsertLock. If every write contains an implicit
sync, that call would be much more expensive than necessary.

--
Manfred


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Manfred Spraul <manfred(at)colorfullife(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2003-12-12 21:28:47
Message-ID: 22121.1071264527@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Manfred Spraul <manfred(at)colorfullife(dot)com> writes:
> One advantage of a seperate write and fsync call is better performance
> for the writes that are triggered within AdvanceXLInsertBuffer: I'm not
> sure how often that's necessary, but it's a write while holding both the
> WALWriteLock and WALInsertLock. If every write contains an implicit
> sync, that call would be much more expensive than necessary.

Ideally that path isn't taken very often. But I'm currently having a
discussion off-list with a CMU student who seems to be seeing a case
where it happens a lot. (She reports that both WALWriteLock and
WALInsertLock are causes of a lot of process blockages, which seems to
mean that a lot of the WAL I/O is being done with both held, which would
have to mean that AdvanceXLInsertBuffer is doing the I/O. More when we
figure out what's going on exactly...)

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 17:46:13
Message-ID: 200403181746.i2IHkDA00975@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance


I have been poking around with our fsync default options to see if I can
improve them. One issue is that we never default to O_SYNC, but default
to O_DSYNC if it exists, which seems strange.

What I did was to beef up my test program and get it into CVS for folks
to run. What I found was that different operating systems have
different optimal defaults. On BSD/OS and FreeBSD, fdatasync/fsync was
better, but on Linux, O_DSYNC/O_SYNC was faster.

BSD/OS 4.3:
Simple write timing:
write 0.000055

Compare fsync before and after write's close:
write, fsync, close 0.000707
write, close, fsync 0.000808

Compare one o_sync write to two:
one 16k o_sync write 0.009762
two 8k o_sync writes 0.008799

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 0.000658
(fdatasync unavailable)
write, fsync, 0.000702

Compare file sync methods with 2 8k writes:
(The fastest should be used for wal_sync_method)
(o_dsync unavailable)
open o_sync, write 0.010402
(fdatasync unavailable)
write, fsync, 0.001025

This shows terrible O_SYNC performance for 2 8k writes, but is faster
for a single 8k write. Strange.

FreeBSD 4.9:
Simple write timing:
write 0.000083

Compare fsync before and after write's close:
write, fsync, close 0.000412
write, close, fsync 0.000453

Compare one o_sync write to two:
one 16k o_sync write 0.000409
two 8k o_sync writes 0.000993

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 0.000683
(fdatasync unavailable)
write, fsync, 0.000405

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.000789
(fdatasync unavailable)
write, fsync, 0.000414

This shows fsync to be fastest in both cases.

Linux 2.4.9:
Simple write timing:
write 0.000061

Compare fsync before and after write's close:
write, fsync, close 0.000398
write, close, fsync 0.000407

Compare one o_sync write to two:
one 16k o_sync write 0.000570
two 8k o_sync writes 0.000340

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 0.000166
write, fdatasync 0.000462
write, fsync, 0.000447

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.000334
write, fdatasync 0.000445
write, fsync, 0.000447

This shows O_SYNC to be fastest, even for 2 8k writes.

This unapplied patch:

ftp://candle.pha.pa.us/pub/postgresql/mypatches/fsync

adds DEFAULT_OPEN_SYNC to the bsdi/freebsd/linux template files, which
controls the default for those platforms. Platforms with no template
default to fdatasync/fsync.

Would other users run src/tools/fsync and report their findings so I can
update the template files for their OS's? This is a process similar to
our thread testing.

Thanks.

---------------------------------------------------------------------------

Bruce Momjian wrote:
> Mark Kirkwood wrote:
> > This is a well-worn thread title - apologies, but these results seemed
> > interesting, and hopefully useful in the quest to get better performance
> > on Solaris:
> >
> > I was curious to see if the rather uninspiring pgbench performance
> > obtained from a Sun 280R (see General: ATA Disks and RAID controllers
> > for database servers) could be improved if more time was spent
> > tuning.
> >
> > With the help of a fellow workmate who is a bit of a Solaris guy, we
> > decided to have a go.
> >
> > The major performance killer appeared to be mounting the filesystem with
> > the logging option. The next most significant seemed to be the choice of
> > sync_method for Pg - the default (open_datasync), which we initially
> > thought should be the best - appears noticeably slower than fdatasync.
>
> I thought the default was fdatasync, but looking at the code it seems
> the default is open_datasync if O_DSYNC is available.
>
> I assume the logic is that we usually do only one write() before
> fsync(), so open_datasync should be faster. Why do we not use O_FSYNC
> over fsync().
>
> Looking at the code:
>
> #if defined(O_SYNC)
> #define OPEN_SYNC_FLAG O_SYNC
> #else
> #if defined(O_FSYNC)
> #define OPEN_SYNC_FLAG O_FSYNC
> #endif
> #endif
>
> #if defined(OPEN_SYNC_FLAG)
> #if defined(O_DSYNC) && (O_DSYNC != OPEN_SYNC_FLAG)
> #define OPEN_DATASYNC_FLAG O_DSYNC
> #endif
> #endif
>
> #if defined(OPEN_DATASYNC_FLAG)
> #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> #define DEFAULT_SYNC_FLAGBIT OPEN_DATASYNC_FLAG
> #else
> #if defined(HAVE_FDATASYNC)
> #define DEFAULT_SYNC_METHOD_STR "fdatasync"
> #define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
> #define DEFAULT_SYNC_FLAGBIT 0
> #else
> #define DEFAULT_SYNC_METHOD_STR "fsync"
> #define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC
> #define DEFAULT_SYNC_FLAGBIT 0
> #endif
> #endif
>
> I think the problem is that we prefer O_DSYNC over fdatasync, but do not
> prefer O_FSYNC over fsync.
>
> Running the attached test program shows on BSD/OS 4.3:
>
> write 0.000360
> write & fsync 0.001391
> write, close & fsync 0.001308
> open o_fsync, write 0.000924
>
> showing O_FSYNC faster than fsync().
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
> + If your life is a hard drive, | 13 Roberts Road
> + Christ can be your backup. | Newtown Square, Pennsylvania 19073

> /*
> * test_fsync.c
> * tests if fsync can be done from another process than the original write
> */
>
> #include <sys/types.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <time.h>
> #include <unistd.h>
>
> void die(char *str);
> void print_elapse(struct timeval start_t, struct timeval elapse_t);
>
> int main(int argc, char *argv[])
> {
> struct timeval start_t;
> struct timeval elapse_t;
> int tmpfile;
> char *strout = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
>
> /* write only */
> gettimeofday(&start_t, NULL);
> if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1)
> die("can't open /var/tmp/test_fsync.out");
> write(tmpfile, &strout, 200);
> close(tmpfile);
> gettimeofday(&elapse_t, NULL);
> unlink("/var/tmp/test_fsync.out");
> printf("write ");
> print_elapse(start_t, elapse_t);
> printf("\n");
>
> /* write & fsync */
> gettimeofday(&start_t, NULL);
> if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1)
> die("can't open /var/tmp/test_fsync.out");
> write(tmpfile, &strout, 200);
> fsync(tmpfile);
> close(tmpfile);
> gettimeofday(&elapse_t, NULL);
> unlink("/var/tmp/test_fsync.out");
> printf("write & fsync ");
> print_elapse(start_t, elapse_t);
> printf("\n");
>
> /* write, close & fsync */
> gettimeofday(&start_t, NULL);
> if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1)
> die("can't open /var/tmp/test_fsync.out");
> write(tmpfile, &strout, 200);
> close(tmpfile);
> /* reopen file */
> if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1)
> die("can't open /var/tmp/test_fsync.out");
> fsync(tmpfile);
> close(tmpfile);
> gettimeofday(&elapse_t, NULL);
> unlink("/var/tmp/test_fsync.out");
> printf("write, close & fsync ");
> print_elapse(start_t, elapse_t);
> printf("\n");
>
> /* open_fsync, write */
> gettimeofday(&start_t, NULL);
> if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT | O_FSYNC)) == -1)
> die("can't open /var/tmp/test_fsync.out");
> write(tmpfile, &strout, 200);
> close(tmpfile);
> gettimeofday(&elapse_t, NULL);
> unlink("/var/tmp/test_fsync.out");
> printf("open o_fsync, write ");
> print_elapse(start_t, elapse_t);
> printf("\n");
>
> return 0;
> }
>
> void print_elapse(struct timeval start_t, struct timeval elapse_t)
> {
> if (elapse_t.tv_usec < start_t.tv_usec)
> {
> elapse_t.tv_sec--;
> elapse_t.tv_usec += 1000000;
> }
>
> printf("%ld.%06ld", (long) (elapse_t.tv_sec - start_t.tv_sec),
> (long) (elapse_t.tv_usec - start_t.tv_usec));
> }
>
> void die(char *str)
> {
> fprintf(stderr, "%s", str);
> exit(1);
> }

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fsync method checking
Date: 2004-03-18 18:23:14
Message-ID: 4059E912.2050509@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Bruce Momjian wrote:

>I have been poking around with our fsync default options to see if I can
>improve them. One issue is that we never default to O_SYNC, but default
>to O_DSYNC if it exists, which seems strange.
>
>What I did was to beef up my test program and get it into CVS for folks
>to run. What I found was that different operating systems have
>different optimal defaults. On BSD/OS and FreeBSD, fdatasync/fsync was
>better, but on Linux, O_DSYNC/O_SYNC was faster.
>
>[snip]
>
>Linux 2.4.9:
>
>

This is a pretty old kernel (I am writing from a machine running 2.4.22)

Maybe before we do this for Linux testing on a more modern kernel might
be wise.

cheers

andrew


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fsync method checking
Date: 2004-03-18 18:40:43
Message-ID: 200403181840.i2IIeh309334@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Andrew Dunstan wrote:
>
>
> Bruce Momjian wrote:
>
> >I have been poking around with our fsync default options to see if I can
> >improve them. One issue is that we never default to O_SYNC, but default
> >to O_DSYNC if it exists, which seems strange.
> >
> >What I did was to beef up my test program and get it into CVS for folks
> >to run. What I found was that different operating systems have
> >different optimal defaults. On BSD/OS and FreeBSD, fdatasync/fsync was
> >better, but on Linux, O_DSYNC/O_SYNC was faster.
> >
> >[snip]
> >
> >Linux 2.4.9:
> >
> >
>
> This is a pretty old kernel (I am writing from a machine running 2.4.22)
>
> Maybe before we do this for Linux testing on a more modern kernel might
> be wise.

Sure, I am sure someone will post results.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 18:44:02
Message-ID: 11043.1079635442@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> I have been poking around with our fsync default options to see if I can
> improve them. One issue is that we never default to O_SYNC, but default
> to O_DSYNC if it exists, which seems strange.

As I recall, that was based on testing on some different platforms.
It's not particularly "strange": O_SYNC implies writing at least two
places on the disk (file and inode). O_DSYNC or fdatasync should
theoretically be the fastest alternatives, O_SYNC and fsync the worst.

> Compare fsync before and after write's close:
> write, fsync, close 0.000707
> write, close, fsync 0.000808

What does that mean? You can't fsync a closed file.

> This shows terrible O_SYNC performance for 2 8k writes, but is faster
> for a single 8k write. Strange.

I'm not sure I believe these numbers at all... my experience is that
getting trustworthy disk I/O numbers is *not* easy.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 18:50:32
Message-ID: 200403181850.i2IIoWo11126@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > I have been poking around with our fsync default options to see if I can
> > improve them. One issue is that we never default to O_SYNC, but default
> > to O_DSYNC if it exists, which seems strange.
>
> As I recall, that was based on testing on some different platforms.
> It's not particularly "strange": O_SYNC implies writing at least two
> places on the disk (file and inode). O_DSYNC or fdatasync should
> theoretically be the fastest alternatives, O_SYNC and fsync the worst.

But why perfer O_DSYNC over fdatasync if you don't prefer O_SYNC over
fsync?

>
> > Compare fsync before and after write's close:
> > write, fsync, close 0.000707
> > write, close, fsync 0.000808
>
> What does that mean? You can't fsync a closed file.

You reopen and fsync.

> > This shows terrible O_SYNC performance for 2 8k writes, but is faster
> > for a single 8k write. Strange.
>
> I'm not sure I believe these numbers at all... my experience is that
> getting trustworthy disk I/O numbers is *not* easy.

These numbers were reproducable on all the platforms I tested.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Kurt Roeckx <Q(at)ping(dot)be>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 19:18:40
Message-ID: 20040318191840.GA7088@ping.be
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Thu, Mar 18, 2004 at 01:50:32PM -0500, Bruce Momjian wrote:
> > I'm not sure I believe these numbers at all... my experience is that
> > getting trustworthy disk I/O numbers is *not* easy.
>
> These numbers were reproducable on all the platforms I tested.

It's not because they are reproducable that they mean anything in
the real world.

Kurt


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Kurt Roeckx <Q(at)ping(dot)be>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 19:22:10
Message-ID: 200403181922.i2IJMAd16258@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Kurt Roeckx wrote:
> On Thu, Mar 18, 2004 at 01:50:32PM -0500, Bruce Momjian wrote:
> > > I'm not sure I believe these numbers at all... my experience is that
> > > getting trustworthy disk I/O numbers is *not* easy.
> >
> > These numbers were reproducable on all the platforms I tested.
>
> It's not because they are reproducable that they mean anything in
> the real world.

OK, what better test do you suggest? Right now, there has been no
testing of these.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 19:28:07
Message-ID: 11459.1079638087@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> As I recall, that was based on testing on some different platforms.

> But why perfer O_DSYNC over fdatasync if you don't prefer O_SYNC over
> fsync?

It's what tested out as the best bet. I think we were using pgbench
as the test platform, which as you know I have doubts about, but at
least it is testing one actual write/sync pattern Postgres can generate.
The choice between the open flags and fdatasync/fsync depends a whole
lot on your writing patterns (how much data you tend to write between
fsync points), so I don't have a lot of faith in randomly-chosen test
programs as a guide to what to use for Postgres.

>> What does that mean? You can't fsync a closed file.

> You reopen and fsync.

Um. I just looked at that test program, and I think it needs a whole
lot of work yet.

* Some of the test cases count open()/close() overhead, some don't.
This is bad, especially on platforms like Solaris where open() is
notoriously expensive.

* You really cannot put any faith in measuring a single write,
especially on a machine that's not *completely* idle otherwise.
I'd feel somewhat comfortable if you wrote, say, 1000 8K blocks and
measured the time for that. (And you have to think about how far
apart the fsyncs are in that sequence; you probably want to repeat the
measurement with several different fsync spacings.) It would also be
a good idea to compare writing 1000 successive blocks with rewriting
the same block 1000 times --- if the latter does not happen roughly
at the disk RPM rate, then we know the drive is lying and all the
numbers should be discarded as meaningless.

* The program is claimed to test whether you can write from one process
and fsync from another, but it does no such thing AFAICS.

BTW, rather than hard-wiring the test file name, why don't you let it be
specified on the command line? That would make it lots easier for
people to compare the performance of several disk drives, if they have
'em.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 19:55:29
Message-ID: 200403181955.i2IJtT721737@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Tom Lane wrote:
> >> As I recall, that was based on testing on some different platforms.
>
> > But why perfer O_DSYNC over fdatasync if you don't prefer O_SYNC over
> > fsync?
>
> It's what tested out as the best bet. I think we were using pgbench
> as the test platform, which as you know I have doubts about, but at
> least it is testing one actual write/sync pattern Postgres can generate.
> The choice between the open flags and fdatasync/fsync depends a whole
> lot on your writing patterns (how much data you tend to write between
> fsync points), so I don't have a lot of faith in randomly-chosen test
> programs as a guide to what to use for Postgres.

I assume pgbench has so much variance that trying to see fsync changes
in there would be hopeless.

> >> What does that mean? You can't fsync a closed file.
>
> > You reopen and fsync.
>
> Um. I just looked at that test program, and I think it needs a whole
> lot of work yet.
>
> * Some of the test cases count open()/close() overhead, some don't.
> This is bad, especially on platforms like Solaris where open() is
> notoriously expensive.

The only one I saw that had an extra open() was the fsync after close
test. I add a do-nothing open/close to the previous test so they are
the same.

> * You really cannot put any faith in measuring a single write,
> especially on a machine that's not *completely* idle otherwise.
> I'd feel somewhat comfortable if you wrote, say, 1000 8K blocks and
> measured the time for that. (And you have to think about how far

OK, it now measures a loop of 1000.

> apart the fsyncs are in that sequence; you probably want to repeat the
> measurement with several different fsync spacings.) It would also be
> a good idea to compare writing 1000 successive blocks with rewriting
> the same block 1000 times --- if the latter does not happen roughly
> at the disk RPM rate, then we know the drive is lying and all the
> numbers should be discarded as meaningless.

>
> * The program is claimed to test whether you can write from one process
> and fsync from another, but it does no such thing AFAICS.

It really just shows whether the fsync fater the close has similar
timing to the one before the close. That was the best way I could think
to test it.

> BTW, rather than hard-wiring the test file name, why don't you let it be
> specified on the command line? That would make it lots easier for
> people to compare the performance of several disk drives, if they have
> 'em.

I have updated the test program in CVS.

New BSD/OS results:

Simple write timing:
write 0.034801

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 0.868831
write, close, fsync 0.717281

Compare one o_sync write to two:
one 16k o_sync write 10.121422
two 8k o_sync writes 4.405151

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 1.542213
(fdatasync unavailable)
write, fsync, 1.703689

Compare file sync methods with 2 8k writes:
(The fastest should be used for wal_sync_method)
(o_dsync unavailable)
open o_sync, write 4.498607
(fdatasync unavailable)
write, fsync, 2.473842

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Kurt Roeckx <Q(at)ping(dot)be>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 20:03:59
Message-ID: 20040318200359.GA8330@ping.be
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Thu, Mar 18, 2004 at 02:22:10PM -0500, Bruce Momjian wrote:
>
> OK, what better test do you suggest? Right now, there has been no
> testing of these.

I suggest you start by doing atleast preallocating a 16 MB file
and do the tests on that, to atleast be somewhat simular to what
WAL does.

I have no idea what the access pattern is for normal WAL
operations or how many times it gets synched. Does it only do
f(data)sync() at commit time, or for every block it writes?

I think if you write more data you'll see more differences
between O_(D)SYNC and f(data)sync().

I guess it can depend on if you have lots of small transactions,
or more big ones.

Atleast try to make something that covers different access
patterns.

Kurt


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 20:08:48
Message-ID: 11718.1079640528@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> It's what tested out as the best bet. I think we were using pgbench
>> as the test platform, which as you know I have doubts about, but at
>> least it is testing one actual write/sync pattern Postgres can generate.

> I assume pgbench has so much variance that trying to see fsync changes
> in there would be hopeless.

The results were fairly reproducible, as I recall; else we'd have looked
for another test method. You may want to go back and consult the
pghackers archives.

>> * Some of the test cases count open()/close() overhead, some don't.

> The only one I saw that had an extra open() was the fsync after close
> test. I add a do-nothing open/close to the previous test so they are
> the same.

Why is it sensible to include open/close overhead in the "simple write"
case and not in the "o_sync write" cases, for instance? Doesn't seem
like a fair comparison to me. Adding the open overhead to all cases
might make it "fair", but it would also make it not what we want to
measure.

>> * The program is claimed to test whether you can write from one process
>> and fsync from another, but it does no such thing AFAICS.

> It really just shows whether the fsync fater the close has similar
> timing to the one before the close. That was the best way I could think
> to test it.

Sure, but where's the "separate process" part? What this seems to test
is whether a single process can sync its own writes through a different
file descriptor; which is interesting but by no means the only thing we
need to be sure of if we want to make the bgwriter handle syncing.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Kurt Roeckx <Q(at)ping(dot)be>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 20:09:25
Message-ID: 200403182009.i2IK9PJ24080@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Kurt Roeckx wrote:
> On Thu, Mar 18, 2004 at 02:22:10PM -0500, Bruce Momjian wrote:
> >
> > OK, what better test do you suggest? Right now, there has been no
> > testing of these.
>
> I suggest you start by doing atleast preallocating a 16 MB file
> and do the tests on that, to atleast be somewhat simular to what
> WAL does.
>
> I have no idea what the access pattern is for normal WAL
> operations or how many times it gets synched. Does it only do
> f(data)sync() at commit time, or for every block it writes?
>
> I think if you write more data you'll see more differences
> between O_(D)SYNC and f(data)sync().
>
> I guess it can depend on if you have lots of small transactions,
> or more big ones.
>
> Atleast try to make something that covers different access
> patterns.

OK, I preallocated 16mb. New results:

Simple write timing:
write 0.037900

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 0.692942
write, close, fsync 0.762524

Compare one o_sync write to two:
one 16k o_sync write 8.494621
two 8k o_sync writes 4.177680

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 1.836835
(fdatasync unavailable)
write, fsync, 1.780872

Compare file sync methods with 2 8k writes:
(The fastest should be used for wal_sync_method)
(o_dsync unavailable)
open o_sync, write 4.255614
(fdatasync unavailable)
write, fsync, 2.120843

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kurt Roeckx <Q(at)ping(dot)be>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 20:20:23
Message-ID: 11816.1079641223@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Kurt Roeckx <Q(at)ping(dot)be> writes:
> I have no idea what the access pattern is for normal WAL
> operations or how many times it gets synched. Does it only do
> f(data)sync() at commit time, or for every block it writes?

If we are using fsync/fdatasync, we issue those at commit time or when
completing a WAL segment. If we are using the open flags, then of
course there's no separate sync call.

My previous point about checking different fsync spacings corresponds to
different assumptions about average transaction size. I think a useful
tool for determining wal_sync_method has got to be able to reflect that
range of possibilities.

regards, tom lane


From: Kurt Roeckx <Q(at)ping(dot)be>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 20:26:21
Message-ID: 20040318202621.GB8330@ping.be
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Here are my results on Linux 2.6.1 using cvs version 1.7.

Those times with > 20 seconds, you really hear the disk go crazy.

And I have the feeling something must be wrong. Those results
are reproducible.

Kurt

Simple write timing:
write 0.139558

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 8.249364
write, close, fsync 8.356813

Compare one o_sync write to two:
one 16k o_sync write 28.487650
two 8k o_sync writes 2.310304

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 1.010688
write, fdatasync 25.109604
write, fsync, 26.051218

Compare file sync methods with 2 8k writes:
(The fastest should be used for wal_sync_method)
(o_dsync unavailable)
open o_sync, write 2.212223
write, fdatasync 27.439907
write, fsync, 27.772294


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Kurt Roeckx <Q(at)ping(dot)be>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 20:34:21
Message-ID: 200403182034.i2IKYLp27844@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Kurt Roeckx wrote:
> Here are my results on Linux 2.6.1 using cvs version 1.7.
>
> Those times with > 20 seconds, you really hear the disk go crazy.
>
> And I have the feeling something must be wrong. Those results
> are reproducible.
>

Wow, your O_SYNC times are great. Where can I buy some? :-)

Anyway, we do need to find a way to test this because obviously there is
huge platform variability.

---------------------------------------------------------------------------

>
> Kurt
>
>
> Simple write timing:
> write 0.139558
>
> Compare fsync times on write() and non-write() descriptor:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
> write, fsync, close 8.249364
> write, close, fsync 8.356813
>
> Compare one o_sync write to two:
> one 16k o_sync write 28.487650
> two 8k o_sync writes 2.310304
>
> Compare file sync methods with one 8k write:
> (o_dsync unavailable)
> open o_sync, write 1.010688
> write, fdatasync 25.109604
> write, fsync, 26.051218
>
> Compare file sync methods with 2 8k writes:
> (The fastest should be used for wal_sync_method)
> (o_dsync unavailable)
> open o_sync, write 2.212223
> write, fdatasync 27.439907
> write, fsync, 27.772294
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kurt Roeckx <Q(at)ping(dot)be>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 20:39:58
Message-ID: 200403181239.58226.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Tom, Bruce,

> My previous point about checking different fsync spacings corresponds to
> different assumptions about average transaction size. I think a useful
> tool for determining wal_sync_method has got to be able to reflect that
> range of possibilities.

Questions:
1) This is an OSS project. Why not just recruit a bunch of people on
PERFORMANCE and GENERAL to test the 4 different synch methods using real
databases? No test like reality, I say ....

2) Won't Jan's work on 7.5 memory and I/O management mean that we have to
re-evaluate synching anyway?

--
-Josh Berkus
Aglio Database Solutions
San Francisco


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: josh(at)agliodbs(dot)com
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kurt Roeckx <Q(at)ping(dot)be>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 20:49:20
Message-ID: 200403182049.i2IKnKK00246@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Josh Berkus wrote:
> Tom, Bruce,
>
> > My previous point about checking different fsync spacings corresponds to
> > different assumptions about average transaction size. I think a useful
> > tool for determining wal_sync_method has got to be able to reflect that
> > range of possibilities.
>
> Questions:
> 1) This is an OSS project. Why not just recruit a bunch of people on
> PERFORMANCE and GENERAL to test the 4 different synch methods using real
> databases? No test like reality, I say ....

Well, I wrote the program to allow testing. I don't see a complex test
as being that much better than simple one. We don't need accurate
numbers. We just need to know if fsync or O_SYNC is faster.

>
> 2) Won't Jan's work on 7.5 memory and I/O management mean that we have to
> re-evaluate synching anyway?

No, it should not change sync issues.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: josh(at)agliodbs(dot)com
Cc: Kurt Roeckx <Q(at)ping(dot)be>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 21:00:54
Message-ID: 12261.1079643654@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> 1) This is an OSS project. Why not just recruit a bunch of people on
> PERFORMANCE and GENERAL to test the 4 different synch methods using real
> databases? No test like reality, I say ....

I agree --- that is likely to yield *far* more useful results than
any standalone test program, for the purpose of finding out what
wal_sync_method to use in real databases. However, there's a second
issue here: we would like to move sync/checkpoint responsibility into
the bgwriter, and that requires knowing whether it's valid to let one
process fsync on behalf of writes that were done by other processes.
That's got nothing to do with WAL sync performance. I think that it
would be sensible to make a test program that focuses on this one
specific question. (There has been some handwaving to the effect that
everybody knows this is safe on Unixen, but I question whether the
handwavers have seen the internals of HPUX or AIX for instance; and
besides we need to worry about Windows now.)

A third reason for having a simple test program is to confirm whether
your drives are syncing at all (cf. hdparm discussion).

> 2) Won't Jan's work on 7.5 memory and I/O management mean that we have to
> re-evaluate synching anyway?

So far nothing's been done that touches WAL writing. However, I am
thinking about making the bgwriter process take some of the load of
writing WAL buffers (right now it only writes data-file buffers).
And you're right, after that happens we will need to re-measure.
The open flags will probably become considerably more attractive than
they are now, if the bgwriter handles most non-commit writes of WAL.
(We might also think of letting the bgwriter use a different sync method
than the backends do.)

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: josh(at)agliodbs(dot)com, Kurt Roeckx <Q(at)ping(dot)be>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 21:04:45
Message-ID: 12315.1079643885@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Well, I wrote the program to allow testing. I don't see a complex test
> as being that much better than simple one. We don't need accurate
> numbers. We just need to know if fsync or O_SYNC is faster.

Faster than what? The thing everyone is trying to point out here is
that it depends on context, and we have little faith that this test
program creates a context similar to a live Postgres database.

regards, tom lane


From: Kurt Roeckx <Q(at)ping(dot)be>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 21:09:51
Message-ID: 20040318210951.GA8784@ping.be
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Thu, Mar 18, 2004 at 03:34:21PM -0500, Bruce Momjian wrote:
> Kurt Roeckx wrote:
> > Here are my results on Linux 2.6.1 using cvs version 1.7.
> >
> > Those times with > 20 seconds, you really hear the disk go crazy.
> >
> > And I have the feeling something must be wrong. Those results
> > are reproducible.
> >
>
> Wow, your O_SYNC times are great. Where can I buy some? :-)
>
> Anyway, we do need to find a way to test this because obviously there is
> huge platform variability.

New results with version 1.8:

Simple write timing:
write 0.150613

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 9.170472
write, close, fsync 8.851715

Compare one o_sync write to two:
one 16k o_sync write 2.617860
two 8k o_sync writes 2.563437

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 1.031721
write, fdatasync 25.599010
write, fsync, 26.192824

Compare file sync methods with 2 8k writes:
(The fastest should be used for wal_sync_method)
(o_dsync unavailable)
open o_sync, write 2.268718
write, fdatasync 27.029396
write, fsync, 27.399243


From: Kevin Brown <kevin(at)sysexperts(dot)com>
To: pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-19 00:41:12
Message-ID: 20040319004112.GA19547@filer
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Well, I wrote the program to allow testing. I don't see a complex test
> > as being that much better than simple one. We don't need accurate
> > numbers. We just need to know if fsync or O_SYNC is faster.
>
> Faster than what? The thing everyone is trying to point out here is
> that it depends on context, and we have little faith that this test
> program creates a context similar to a live Postgres database.

Note, too, that the preferred method isn't likely to depend just on the
operating system, it's likely to depend also on the filesystem type
being used.

Linux provides quite a few of them: ext2, ext3, jfs, xfs, and reiserfs,
and that's just off the top of my head. I imagine the performance of
the various syncing methods will vary significantly between them.

It seems reasonable to me that decisions such as which sync method to
use should initially be made at installation time: have the test program
run on the target filesystem as part of the installation process, and
build the initial postgresql.conf based on the results. You might even
be able to do some additional testing such as measuring the difference
between random block access and sequential access, and again feed the
results into the postgresql.conf file. This is no substitute for
experience with the platform, but I expect it's likely to get you closer
to something optimal than doing nothing. The only question, of course,
is whether or not it's worth going to the effort when it may or may not
gain you a whole lot. Answering that is going to require some
experimentation with such an automatic configuration system.

--
Kevin Brown kevin(at)sysexperts(dot)com


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-19 04:08:31
Message-ID: 200403190408.i2J48VO12436@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Tom Lane wrote:
> > It really just shows whether the fsync fater the close has similar
> > timing to the one before the close. That was the best way I could think
> > to test it.
>
> Sure, but where's the "separate process" part? What this seems to test
> is whether a single process can sync its own writes through a different
> file descriptor; which is interesting but by no means the only thing we
> need to be sure of if we want to make the bgwriter handle syncing.

I am not sure how to easily test if a separate process can do the same.
I am sure it can be done, but for me it was enough to see that it works
in a single process. Unix isn't very process-centered for I/O, so I
don't think it would make much of a difference. Now, Win32, that might
be an issue.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Kevin Brown <kevin(at)sysexperts(dot)com>
To: pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-20 03:48:17
Message-ID: 20040320034817.GA9566@filer
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

I wrote:
> Note, too, that the preferred method isn't likely to depend just on the
> operating system, it's likely to depend also on the filesystem type
> being used.
>
> Linux provides quite a few of them: ext2, ext3, jfs, xfs, and reiserfs,
> and that's just off the top of my head. I imagine the performance of
> the various syncing methods will vary significantly between them.

For what it's worth, my database throughput for transactions involving
a lot of inserts, updates, and deletes is about 12% faster using
fdatasync() than O_SYNC under Linux using JFS.

I'll run the test program and report my results with it as well, so
we'll be able to see if there's any consistency between it and the live
database.

--
Kevin Brown kevin(at)sysexperts(dot)com


From: markw(at)osdl(dot)org
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: josh(at)agliodbs(dot)com, Q(at)ping(dot)be, pgman(at)candle(dot)pha(dot)pa(dot)us, markir(at)paradise(dot)net(dot)nz, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-22 17:33:59
Message-ID: 200403221734.i2MHYGE01546@mail.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 18 Mar, Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> 1) This is an OSS project. Why not just recruit a bunch of people on
>> PERFORMANCE and GENERAL to test the 4 different synch methods using real
>> databases? No test like reality, I say ....
>
> I agree --- that is likely to yield *far* more useful results than
> any standalone test program, for the purpose of finding out what
> wal_sync_method to use in real databases. However, there's a second
> issue here: we would like to move sync/checkpoint responsibility into
> the bgwriter, and that requires knowing whether it's valid to let one
> process fsync on behalf of writes that were done by other processes.
> That's got nothing to do with WAL sync performance. I think that it
> would be sensible to make a test program that focuses on this one
> specific question. (There has been some handwaving to the effect that
> everybody knows this is safe on Unixen, but I question whether the
> handwavers have seen the internals of HPUX or AIX for instance; and
> besides we need to worry about Windows now.)

I could certainly do some testing if you want to see how DBT-2 does.
Just tell me what to do. ;)

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: markw(at)osdl(dot)org
Cc: josh(at)agliodbs(dot)com, Q(at)ping(dot)be, pgman(at)candle(dot)pha(dot)pa(dot)us, markir(at)paradise(dot)net(dot)nz, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-22 17:41:14
Message-ID: 6391.1079977274@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

markw(at)osdl(dot)org writes:
> I could certainly do some testing if you want to see how DBT-2 does.
> Just tell me what to do. ;)

Just do some runs that are identical except for the wal_sync_method
setting. Note that this should not have any impact on SELECT
performance, only insert/update/delete performance.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: markw(at)osdl(dot)org
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, josh(at)agliodbs(dot)com, Q(at)ping(dot)be, markir(at)paradise(dot)net(dot)nz, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-22 17:42:43
Message-ID: 200403221742.i2MHghv15851@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

markw(at)osdl(dot)org wrote:
> On 18 Mar, Tom Lane wrote:
> > Josh Berkus <josh(at)agliodbs(dot)com> writes:
> >> 1) This is an OSS project. Why not just recruit a bunch of people on
> >> PERFORMANCE and GENERAL to test the 4 different synch methods using real
> >> databases? No test like reality, I say ....
> >
> > I agree --- that is likely to yield *far* more useful results than
> > any standalone test program, for the purpose of finding out what
> > wal_sync_method to use in real databases. However, there's a second
> > issue here: we would like to move sync/checkpoint responsibility into
> > the bgwriter, and that requires knowing whether it's valid to let one
> > process fsync on behalf of writes that were done by other processes.
> > That's got nothing to do with WAL sync performance. I think that it
> > would be sensible to make a test program that focuses on this one
> > specific question. (There has been some handwaving to the effect that
> > everybody knows this is safe on Unixen, but I question whether the
> > handwavers have seen the internals of HPUX or AIX for instance; and
> > besides we need to worry about Windows now.)
>
> I could certainly do some testing if you want to see how DBT-2 does.
> Just tell me what to do. ;)

To test, you would run from CVS version src/tools/fsync, find the
fastest fsync method from the last group of outputs, then try the
wal_fsync_method setting to see if the one that tools/fsync says is
fastest is actually fastest. However, it might be better to run your
tests and get some indication of how frequently writes and fsync's are
going to WAL and modify tools/fsync to match what your DBT-2 test does.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Manfred Spraul <manfred(at)colorfullife(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: markw(at)osdl(dot)org, josh(at)agliodbs(dot)com, Q(at)ping(dot)be, pgman(at)candle(dot)pha(dot)pa(dot)us, markir(at)paradise(dot)net(dot)nz, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-25 06:21:35
Message-ID: 40627A6F.9010202@colorfullife.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Tom Lane wrote:

>markw(at)osdl(dot)org writes:
>
>
>>I could certainly do some testing if you want to see how DBT-2 does.
>>Just tell me what to do. ;)
>>
>>
>
>Just do some runs that are identical except for the wal_sync_method
>setting. Note that this should not have any impact on SELECT
>performance, only insert/update/delete performance.
>
>
I've made a test run that compares fsync and fdatasync: The performance
was identical:
- with fdatasync:

http://khack.osdl.org/stp/290607/

- with fsync:
http://khack.osdl.org/stp/290483/

I don't understand why. Mark - is there a battery backed write cache in
the raid controller, or something similar that might skew the results?
The test generates quite a lot of wal traffic - around 1.5 MB/sec.
Perhaps the writes are so large that the added overhead of syncing the
inode is not noticable?
Is the pg_xlog directory on a seperate drive?

Btw, it's possible to request such tests through the web-interface, see
http://www.osdl.org/lab_activities/kernel_testing/stp/script_param.html

--
Manfred


From: markw(at)osdl(dot)org
To: manfred(at)colorfullife(dot)com
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, josh(at)agliodbs(dot)com, Q(at)ping(dot)be, pgman(at)candle(dot)pha(dot)pa(dot)us, markir(at)paradise(dot)net(dot)nz, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-25 17:16:40
Message-ID: 200403251716.i2PHGh222327@mail.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 25 Mar, Manfred Spraul wrote:
> Tom Lane wrote:
>
>>markw(at)osdl(dot)org writes:
>>
>>
>>>I could certainly do some testing if you want to see how DBT-2 does.
>>>Just tell me what to do. ;)
>>>
>>>
>>
>>Just do some runs that are identical except for the wal_sync_method
>>setting. Note that this should not have any impact on SELECT
>>performance, only insert/update/delete performance.
>>
>>
> I've made a test run that compares fsync and fdatasync: The performance
> was identical:
> - with fdatasync:
>
> http://khack.osdl.org/stp/290607/
>
> - with fsync:
> http://khack.osdl.org/stp/290483/
>
> I don't understand why. Mark - is there a battery backed write cache in
> the raid controller, or something similar that might skew the results?
> The test generates quite a lot of wal traffic - around 1.5 MB/sec.
> Perhaps the writes are so large that the added overhead of syncing the
> inode is not noticable?
> Is the pg_xlog directory on a seperate drive?
>
> Btw, it's possible to request such tests through the web-interface, see
> http://www.osdl.org/lab_activities/kernel_testing/stp/script_param.html

We have 2 Adaptec 2200s controllers, without the battery backed add-on,
connected to four 10-disk arrays in those systems. I can't think of
anything off hand that would skew the results.

The pg_xlog directory is not on a separate drive. I haven't found the
best way to lay out of the drives on those systems yet, so I just have
everything on a 28 drive lvm2 volume.

Mark


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: markw(at)osdl(dot)org
Cc: manfred(at)colorfullife(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, josh(at)agliodbs(dot)com, Q(at)ping(dot)be, markir(at)paradise(dot)net(dot)nz, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-25 18:52:56
Message-ID: 200403251852.i2PIquv29582@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

markw(at)osdl(dot)org wrote:
> > I've made a test run that compares fsync and fdatasync: The performance
> > was identical:
> > - with fdatasync:
> >
> > http://khack.osdl.org/stp/290607/
> >
> > - with fsync:
> > http://khack.osdl.org/stp/290483/
> >
> > I don't understand why. Mark - is there a battery backed write cache in
> > the raid controller, or something similar that might skew the results?
> > The test generates quite a lot of wal traffic - around 1.5 MB/sec.
> > Perhaps the writes are so large that the added overhead of syncing the
> > inode is not noticable?
> > Is the pg_xlog directory on a seperate drive?
> >
> > Btw, it's possible to request such tests through the web-interface, see
> > http://www.osdl.org/lab_activities/kernel_testing/stp/script_param.html
>
> We have 2 Adaptec 2200s controllers, without the battery backed add-on,
> connected to four 10-disk arrays in those systems. I can't think of
> anything off hand that would skew the results.
>
> The pg_xlog directory is not on a separate drive. I haven't found the
> best way to lay out of the drives on those systems yet, so I just have
> everything on a 28 drive lvm2 volume.

We don't actually extend the WAL file during writes (preallocated), and
the access/modification timestamp is only in seconds, so I wonder of the
OS only updates the inode once a second. What else would change in the
inode more frequently than once a second?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, markw(at)osdl(dot)org
Cc: manfred(at)colorfullife(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, Q(at)ping(dot)be, markir(at)paradise(dot)net(dot)nz, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-25 19:10:55
Message-ID: 200403251110.55889.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Bruce,

> We don't actually extend the WAL file during writes (preallocated), and
> the access/modification timestamp is only in seconds, so I wonder of the
> OS only updates the inode once a second. What else would change in the
> inode more frequently than once a second?

What about really big writes, when WAL files are getting added/recycled?

--
-Josh Berkus
Aglio Database Solutions
San Francisco


From: markw(at)osdl(dot)org
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: josh(at)agliodbs(dot)com, Q(at)ping(dot)be, pgman(at)candle(dot)pha(dot)pa(dot)us, markir(at)paradise(dot)net(dot)nz, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-25 21:46:56
Message-ID: 200403252147.i2PLkx209812@mail.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 22 Mar, Tom Lane wrote:
> markw(at)osdl(dot)org writes:
>> I could certainly do some testing if you want to see how DBT-2 does.
>> Just tell me what to do. ;)
>
> Just do some runs that are identical except for the wal_sync_method
> setting. Note that this should not have any impact on SELECT
> performance, only insert/update/delete performance.

Ok, here are the results I have from my 4-way xeon system, a 14 disk
volume for the log and a 52 disk volume for everything else:
http://developer.osdl.org/markw/pgsql/wal_sync_method.html

7.5devel-200403222

wal_sync_method metric
default (fdatasync) 1935.28
fsync 1613.92

# ./test_fsync -f /opt/pgdb/dbt2/pg_xlog/test.out
Simple write timing:
write 0.018787

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 13.057781
write, close, fsync 13.311313

Compare one o_sync write to two:
one 16k o_sync write 6.515122
two 8k o_sync writes 12.455124

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 6.270724
write, fdatasync 13.275225
write, fsync, 13.359847

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 12.479563
write, fdatasync 13.651709
write, fsync, 14.000240


From: Manfred Spraul <manfred(at)colorfullife(dot)com>
To: markw(at)osdl(dot)org
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, josh(at)agliodbs(dot)com, Q(at)ping(dot)be, pgman(at)candle(dot)pha(dot)pa(dot)us, markir(at)paradise(dot)net(dot)nz, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-26 06:25:53
Message-ID: 4063CCF1.2030307@colorfullife.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

markw(at)osdl(dot)org wrote:

>Compare file sync methods with one 8k write:
> (o_dsync unavailable)
> open o_sync, write 6.270724
> write, fdatasync 13.275225
> write, fsync, 13.359847
>
>
Odd. Which filesystem, which kernel? It seems fdatasync is broken and
syncs the inode, too.

--
Manfred


From: markw(at)osdl(dot)org
To: manfred(at)colorfullife(dot)com
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, josh(at)agliodbs(dot)com, Q(at)ping(dot)be, pgman(at)candle(dot)pha(dot)pa(dot)us, markir(at)paradise(dot)net(dot)nz, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-26 16:09:43
Message-ID: 200403261609.i2QG9l220991@mail.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 26 Mar, Manfred Spraul wrote:
> markw(at)osdl(dot)org wrote:
>
>>Compare file sync methods with one 8k write:
>> (o_dsync unavailable)
>> open o_sync, write 6.270724
>> write, fdatasync 13.275225
>> write, fsync, 13.359847
>>
>>
> Odd. Which filesystem, which kernel? It seems fdatasync is broken and
> syncs the inode, too.

It's linux-2.6.5-rc1 with ext2 filesystems.


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: markw(at)osdl(dot)org
Cc: manfred(at)colorfullife(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, josh(at)agliodbs(dot)com, Q(at)ping(dot)be, markir(at)paradise(dot)net(dot)nz, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-26 16:54:59
Message-ID: 200403261654.i2QGsxt04146@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

markw(at)osdl(dot)org wrote:
> On 26 Mar, Manfred Spraul wrote:
> > markw(at)osdl(dot)org wrote:
> >
> >>Compare file sync methods with one 8k write:
> >> (o_dsync unavailable)
> >> open o_sync, write 6.270724
> >> write, fdatasync 13.275225
> >> write, fsync, 13.359847
> >>
> >>
> > Odd. Which filesystem, which kernel? It seems fdatasync is broken and
> > syncs the inode, too.
>
> It's linux-2.6.5-rc1 with ext2 filesystems.

Would you benchmark open_sync for wal_sync_method too?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: markw(at)osdl(dot)org
To: pgman(at)candle(dot)pha(dot)pa(dot)us
Cc: manfred(at)colorfullife(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, josh(at)agliodbs(dot)com, Q(at)ping(dot)be, markir(at)paradise(dot)net(dot)nz, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-26 17:00:56
Message-ID: 200403261701.i2QH0x203524@mail.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On 26 Mar, Bruce Momjian wrote:
> markw(at)osdl(dot)org wrote:
>> On 26 Mar, Manfred Spraul wrote:
>> > markw(at)osdl(dot)org wrote:
>> >
>> >>Compare file sync methods with one 8k write:
>> >> (o_dsync unavailable)
>> >> open o_sync, write 6.270724
>> >> write, fdatasync 13.275225
>> >> write, fsync, 13.359847
>> >>
>> >>
>> > Odd. Which filesystem, which kernel? It seems fdatasync is broken and
>> > syncs the inode, too.
>>
>> It's linux-2.6.5-rc1 with ext2 filesystems.
>
> Would you benchmark open_sync for wal_sync_method too?

Oh yeah. Will try to get results later today.

Mark


From: Steve Atkins <steve(at)blighty(dot)com>
To: pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-26 23:14:59
Message-ID: 20040326231459.GA23615@gp.word-to-the-wise.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Mar 26, 2004 at 07:25:53AM +0100, Manfred Spraul wrote:

> >Compare file sync methods with one 8k write:
> > (o_dsync unavailable)
> > open o_sync, write 6.270724
> > write, fdatasync 13.275225
> > write, fsync, 13.359847
> >
> >
> Odd. Which filesystem, which kernel? It seems fdatasync is broken and
> syncs the inode, too.

This may be relevant.

From the man page for fdatasync on a moderately recent RedHat installation:

BUGS
Currently (Linux 2.2) fdatasync is equivalent to fsync.

Cheers,
Steve