Re: AW: AW: AW: WAL does not recover gracefully from out-of -dis k-sp ace

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zeugswetter Andreas SB <ZeugswetterA(at)wien(dot)spardat(dot)at>
Cc: "Mikheev, Vadim" <vmikheev(at)SECTORBASE(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: AW: AW: AW: WAL does not recover gracefully from out-of -dis k-sp ace
Date: 2001-03-09 18:42:24
Message-ID: 7642.984163344@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Zeugswetter Andreas SB <ZeugswetterA(at)wien(dot)spardat(dot)at> writes:
>> This seems odd. As near as I can tell, O_SYNC is simply a command to do
>> fsync implicitly during each write call. It cannot save any I/O unless
>> I'm missing something significant. Where is the performance difference
>> coming from?

> Yes, odd, but sure very reproducible here.

I tried this on HPUX 10.20, which has not only O_SYNC but also O_DSYNC
(defined to do the equivalent of fdatasync()), and got truly fascinating
results. Apparently, on this platform these flags change the kernel's
buffering behavior! Observe:

$ gcc -Wall -O tfsync.c
$ time a.out

real 1m0.32s
user 0m0.02s
sys 0m16.16s
$ gcc -Wall -O -DINIT_WRITE tfsync.c
$ time a.out

real 1m15.11s
user 0m0.04s
sys 0m32.76s

Note the large amount of system time here, and the fact that the extra
time in INIT_WRITE is all system time. I have previously observed that
fsync() on HPUX 10.20 appears to iterate through every kernel disk
buffer belonging to the file, presumably checking their dirtybits one by
one. The INIT_WRITE form loses because each fsync in the second loop
has to iterate through a full 16Mb worth of buffers, whereas without
INIT_WRITE there will only be as many buffers as the amount of file
we've filled so far. (On this platform, it'd probably be a win to use
log segments smaller than 16Mb...) It's interesting that there's no
visible I/O cost here for the extra write pass --- the extra I/O must be
completely overlapped with the extra system time.

$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC tfsync.c
$ time a.out

real 0m45.04s
user 0m0.02s
sys 0m0.83s

We just bought back almost all the system time. The only possible
explanation is that this way either doesn't keep the buffers from prior
blocks, or does not scan them for dirtybits. I note that the open(2)
man page is phrased so that O_SYNC is actually defined not to fsync the
whole file, but only the part you just wrote --- I wonder if it's
actually implemented that way?

$ gcc -Wall -O -DINIT_WRITE -DUSE_SPARSE tfsync.c
$ time a.out

real 1m2.96s
user 0m0.02s
sys 0m27.11s
$ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC -DUSE_SPARSE tfsync.c
$ time a.out

real 1m1.34s
user 0m0.01s
sys 0m0.59s

Sparse initialization wins a little in the non-O_SYNC case, but loses
when compared with O_SYNC on. Not sure why; perhaps it alters the
amount of I/O that has to be done for indirect blocks?

$ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC tfsync.c
$ time a.out

real 0m21.40s
user 0m0.02s
sys 0m0.60s

And the piece de resistance: O_DSYNC *actually works* here, even though
I previously found that the fdatasync() call is stubbed to fsync() in
libc! This old HP box is built like a tank and has a similar lack of
attention to noise level ;-) so I can very easily tell by ear that I am
not getting back-and-forth seeks in this last case, even if the timing
didn't prove it to be true.

$ gcc -Wall -O -DUSE_ODSYNC tfsync.c
$ time a.out

real 1m1.56s
user 0m0.02s
sys 0m0.67s

Without INIT_WRITE, we are back to essentially the performance of fsync
even though we use DSYNC. This is expected since the inode must be
written to change the EOF value. Interestingly, the system time is
small, whereas in my first example it was large; but the elapsed time
is the same. Evidently the system time is nearly all overlapped with
I/O in the first example.

At least on this platform, it would be definitely worthwhile to use
O_DSYNC even if that meant fsync per write rather than per transaction.
Can anyone else reproduce these results?

I attach my modified version of Andreas' program. Note I do not believe
his assertion that close() implies fsync() --- on the machines I've
used, it demonstrably does not sync. You'll also note that I made the
lseek optional in the second loop. This appears to make no real
difference, so I didn't include timings with the lseek enabled.

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 1.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zeugswetter Andreas SB 2001-03-09 18:44:49 AW: AW: AW: WAL does not recover gracefully from out-of -dis k-sp ace
Previous Message Bruce Momjian 2001-03-09 18:03:05 Performance monitor ready