Quick Links

Re: WAL and commit_delay

Lists:	pgsql-hackers

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	WAL and commit_delay
Date:	2001-02-17 18:05:53
Message-ID:	200102171805.NAA24180@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I want to give some background on commit_delay, its initial purpose, and
possible options.

First, looking at the process that happens during a commit:

write() - copy WAL dirty page to kernel disk buffer
fsync() - force WAL kernel disk buffer to disk platter

fsync() take much longer than write().

What Vadim doesn't want is:

time backend 1 backend 2
---- --------- ---------
0 write()
1 fysnc() write()
2 fsync()

This would be better as:

time backend 1 backend 2
---- --------- ---------
0 write()
1 write()
2 fsync() fsync()

This was the purpose of the commit_delay. Having two fsync()'s is not a
problem because only one will see there are dirty buffers. The other
will probably either return right away, or wait for the other's fsync()
to complete.

With the delay, it looks like:

time backend 1 backend 2
---- --------- ---------
0 write()
1 sleep() write()
2 fsync() sleep()
3 fsync()

Which shows the second fsync() doing nothing, which is good, because
there are no dirty buffers at that time. However, a very possible
circumstance is:

time backend 1 backend 2 backend 3
---- --------- --------- ---------
0 write()
1 sleep() write()
2 fsync() sleep() write()
3 fsync() sleep()
4 fsync()

In this case, the fsync() by backend 2 does indeed do some work because
fsync's backend 3's write(). Frankly, I don't see how the sleep does
much except delay things because it doesn't have any smarts about when
the delay is useful, and when it is useless. Without that feedback, I
recommend removing the entire setting. For single backends, the sleep
is clearly a loser.

Another situation it can not deal with is:

time backend 1 backend 2
---- --------- ---------
0 write()
1 sleep()
2 fsync() write()
3 sleep()
4 fsync()

My solution can't deal with this either.

---------------------------------------------------------------------------

The quick fix is to remove the commit_delay code. A more elaborate
performance boost would be to have the each backend get feedback from
other backends, so they can block and wait for other about-to-fsync
backends before fsync(). This allows the write() to bunch up before
the fsync().

Here is the single backend case, which experiences no delays:

time backend 1 backend 2
---- --------- ---------
0 get_shlock()
1 write()
2 rel_shlock()
3 get_exlock()
4 rel_exlock()
5 fsync()

Here is the two-backend case, which shows both write()'s completing
before the fsync()'s:

time backend 1 backend 2
---- --------- ---------
0 get_shlock()
1 write()
2 rel_shlock() get_shlock()
3 get_exlock() write()
4 rel_shlock()
5 rel_exlock()
6 fsync() get_exlock()
7 rel_exlock()
8 fsync()

Contrast that with the first 2 backend case presented above:

time backend 1 backend 2
---- --------- ---------
0 write()
1 fysnc() write()
2 fsync()

Now, it is my understanding that instead of just shared locking around
the write()'s, we could block the entire commit code, so the backend can
signal to other about-to-fsync backends to wait.

I believe our existing lock code can be used for the locking/unlocking.
We can just lock a random, unused table log pg_log or something.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, vadim4o(at)email(dot)com
Subject:	Re: WAL and commit_delay
Date:	2001-02-17 18:46:22
Message-ID:	4356.982435582@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> With the delay, it looks like:

> time backend 1 backend 2
> ---- --------- ---------
> 0 write()
> 1 sleep() write()
> 2 fsync() sleep()
> 3 fsync()

Actually ... take a close look at the code. The delay is done in
xact.c between XLogInsert(commitrecord) and XLogFlush(). As near
as I can tell, both the write() and the fsync() will happen in
XLogFlush(). This means the delay is just plain broken: placed
there, it cannot do anything except waste time.

Another thing I am wondering about is why we're not using fdatasync(),
where available, instead of fsync(). The whole point of preallocating
the WAL files is to make fdatasync safe, no?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, vadim4o(at)email(dot)com
Subject:	Re: WAL and commit_delay
Date:	2001-02-17 18:55:28
Message-ID:	4392.982436128@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Actually ... take a close look at the code. The delay is done in
> xact.c between XLogInsert(commitrecord) and XLogFlush(). As near
> as I can tell, both the write() and the fsync() will happen in
> XLogFlush(). This means the delay is just plain broken: placed
> there, it cannot do anything except waste time.

Uh ... scratch that ... nevermind. The point is that we've inserted
our commit record into the WAL output buffer. Now we are sleeping
in the hope that some other backend will do both the write and the
fsync for us, and that when we eventually call XLogFlush() it will find
nothing to do. So the delay is not in the wrong place.

> Another thing I am wondering about is why we're not using fdatasync(),
> where available, instead of fsync(). The whole point of preallocating
> the WAL files is to make fdatasync safe, no?

This still looks like it'd be a win, by reducing the number of seeks
needed to complete a WAL logfile flush. Right now, each XLogFlush
requires writing both the file's data area and its inode.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, vadim4o(at)email(dot)com
Subject:	Re: WAL and commit_delay
Date:	2001-02-17 19:05:17
Message-ID:	200102171905.OAA28285@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Actually ... take a close look at the code. The delay is done in
> xact.c between XLogInsert(commitrecord) and XLogFlush(). As near
> as I can tell, both the write() and the fsync() will happen in
> XLogFlush(). This means the delay is just plain broken: placed
> there, it cannot do anything except waste time.

I see. :-(

> Another thing I am wondering about is why we're not using fdatasync(),
> where available, instead of fsync(). The whole point of preallocating
> the WAL files is to make fdatasync safe, no?

I don't have fdatasync() here. How does it compare to fsync().

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, vadim4o(at)email(dot)com
Subject:	Re: WAL and commit_delay
Date:	2001-02-17 19:07:11
Message-ID:	200102171907.OAA28383@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > Another thing I am wondering about is why we're not using fdatasync(),
> > where available, instead of fsync(). The whole point of preallocating
> > the WAL files is to make fdatasync safe, no?
>
> This still looks like it'd be a win, by reducing the number of seeks
> needed to complete a WAL logfile flush. Right now, each XLogFlush
> requires writing both the file's data area and its inode.

Don't we have to fsync the inode too? Actually, I was hoping sequential
fsync's could sit on the WAL disk track, but I can imagine it has to
seek around to hit both areas.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, vadim4o(at)email(dot)com
Subject:	Re: WAL and commit_delay
Date:	2001-02-17 19:44:55
Message-ID:	4540.982439095@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Another thing I am wondering about is why we're not using fdatasync(),
> where available, instead of fsync(). The whole point of preallocating
> the WAL files is to make fdatasync safe, no?

> Don't we have to fsync the inode too? Actually, I was hoping sequential
> fsync's could sit on the WAL disk track, but I can imagine it has to
> seek around to hit both areas.

That's the point: we're trying to get things set up so that successive
writes/fsyncs in the WAL file do the minimum amount of seeking. The WAL
code tries to preallocate the whole log file (incorrectly, but that's
easily fixed, see below) so that we should not need to update the file
metadata when we write into the file.

> I don't have fdatasync() here. How does it compare to fsync().

HPUX's man page says

: fdatasync() causes all modified data and file attributes of fildes
: required to retrieve the data to be written to disk.

: fsync() causes all modified data and all file attributes of fildes
: (including access time, modification time and status change time) to
: be written to disk.

The implication is that the only thing you can lose after fdatasync is
the highly-inessential file mod time. However, I have been told that
on some implementations, fdatasync only flushes data blocks, and never
writes the inode or indirect blocks. That would mean that if you had
allocated new disk space to the file, fdatasync would not guarantee
that that allocation was reflected on disk. This is the reason for
preallocating the WAL log file (and doing a full fsync *at that time*).
Then you know the inode block pointers and indirect blocks are down
on disk, and so fdatasync is sufficient even if you have the cheesy
version of fdatasync.

Right now the WAL preallocation code (XLogFileInit) is not good enough
because it does lseek to the 16MB position and then writes 1 byte there.
On an implementation that supports holes in files (which is most Unixen)
that doesn't cause physical allocation of the intervening space. We'd
have to actually write zeroes into all 16MB to ensure the space is
allocated ... but that's just a couple more lines of code.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, vadim4o(at)email(dot)com
Subject:	Re: WAL and commit_delay
Date:	2001-02-17 20:45:30
Message-ID:	200102172045.PAA02841@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Right now the WAL preallocation code (XLogFileInit) is not good enough
> because it does lseek to the 16MB position and then writes 1 byte there.
> On an implementation that supports holes in files (which is most Unixen)
> that doesn't cause physical allocation of the intervening space. We'd
> have to actually write zeroes into all 16MB to ensure the space is
> allocated ... but that's just a couple more lines of code.

Are OS's smart enough to not allocate zero-written blocks? Do we need
to write non-zeros?

From:	Larry Rosenman <ler(at)lerctr(dot)org>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, vadim4o(at)email(dot)com
Subject:	Re: WAL and commit_delay
Date:	2001-02-17 20:48:13
Message-ID:	20010217144813.B2135@lerami.lerctr.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> [010217 14:46]:
> > Right now the WAL preallocation code (XLogFileInit) is not good enough
> > because it does lseek to the 16MB position and then writes 1 byte there.
> > On an implementation that supports holes in files (which is most Unixen)
> > that doesn't cause physical allocation of the intervening space. We'd
> > have to actually write zeroes into all 16MB to ensure the space is
> > allocated ... but that's just a couple more lines of code.
>
> Are OS's smart enough to not allocate zero-written blocks? Do we need
> to write non-zeros?
I don't believe so. writing Zeros is valid.
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
> + If your life is a hard drive, | 830 Blythe Avenue
> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: ler(at)lerctr(dot)org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Larry Rosenman <ler(at)lerctr(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, vadim4o(at)email(dot)com
Subject:	Re: WAL and commit_delay
Date:	2001-02-17 20:50:49
Message-ID:	200102172050.PAA03213@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> * Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> [010217 14:46]:
> > > Right now the WAL preallocation code (XLogFileInit) is not good enough
> > > because it does lseek to the 16MB position and then writes 1 byte there.
> > > On an implementation that supports holes in files (which is most Unixen)
> > > that doesn't cause physical allocation of the intervening space. We'd
> > > have to actually write zeroes into all 16MB to ensure the space is
> > > allocated ... but that's just a couple more lines of code.
> >
> > Are OS's smart enough to not allocate zero-written blocks? Do we need
> > to write non-zeros?
> I don't believe so. writing Zeros is valid.

The reason I ask is because I know you get zeros when trying to read
data from a file with holes, so it seems some OS could actually drop
those blocks from storage.

From:	Larry Rosenman <ler(at)lerctr(dot)org>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, vadim4o(at)email(dot)com
Subject:	Re: WAL and commit_delay
Date:	2001-02-17 20:52:20
Message-ID:	20010217145220.A2549@lerami.lerctr.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> [010217 14:50]:
> > * Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> [010217 14:46]:
> > > > Right now the WAL preallocation code (XLogFileInit) is not good enough
> > > > because it does lseek to the 16MB position and then writes 1 byte there.
> > > > On an implementation that supports holes in files (which is most Unixen)
> > > > that doesn't cause physical allocation of the intervening space. We'd
> > > > have to actually write zeroes into all 16MB to ensure the space is
> > > > allocated ... but that's just a couple more lines of code.
> > >
> > > Are OS's smart enough to not allocate zero-written blocks? Do we need
> > > to write non-zeros?
> > I don't believe so. writing Zeros is valid.
>
> The reason I ask is because I know you get zeros when trying to read
> data from a file with holes, so it seems some OS could actually drop
> those blocks from storage.
I've written swap files and such with:

dd if=/dev/zero of=SWAPFILE bs=512 count=204800

and all the blocks are allocated.

LER

>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
> + If your life is a hard drive, | 830 Blythe Avenue
> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: ler(at)lerctr(dot)org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Larry Rosenman <ler(at)lerctr(dot)org>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, vadim4o(at)email(dot)com
Subject:	Re: WAL and commit_delay
Date:	2001-02-17 22:56:19
Message-ID:	6402.982450579@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Larry Rosenman <ler(at)lerctr(dot)org> writes:
> I've written swap files and such with:
> dd if=/dev/zero of=SWAPFILE bs=512 count=204800
> and all the blocks are allocated.

I've also confirmed that writing zeroes is sufficient on HPUX (du
shows that the correct amount of space is allocated, unlike the
current seek-to-the-end method).

Some poking around the net shows that pre-2.4 Linux kernels implement
fdatasync() as fsync(), and we already knew that BSD hasn't got it
at all. So distinguishing fdatasync from fsync won't be helpful for
very many people yet --- but I still think we should do it. I'm
playing with a test setup in which I just changed pg_fsync to call
fdatasync instead of fsync, and on HPUX I'm seeing pgbench tps values
around 17, as opposed to 13 yesterday. (The HPUX man page warns that
these calls are inefficient for large files, and I wouldn't be surprised
if a lot of the run time is now being spent in the kernel scanning
through all the buffers that belong to the logfile. 2.4 Linux is
apparently reasonably smart about this case, and only looks at the
actually dirty buffers.)

Is anyone out there running a 2.4 Linux kernel? Would you try pgbench
with current sources, commit_delay=0, -B at least 1024, no -F, and see
how the results change when pg_fsync is made to call fdatasync instead
of fsync? (It's in src/backend/storage/file/fd.c)

regards, tom lane

From:	ncm(at)zembu(dot)com (Nathan Myers)
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL and commit_delay
Date:	2001-02-17 23:04:13
Message-ID:	20010217150413.A16600@store.zembu.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 17, 2001 at 03:45:30PM -0500, Bruce Momjian wrote:
> > Right now the WAL preallocation code (XLogFileInit) is not good enough
> > because it does lseek to the 16MB position and then writes 1 byte there.
> > On an implementation that supports holes in files (which is most Unixen)
> > that doesn't cause physical allocation of the intervening space. We'd
> > have to actually write zeroes into all 16MB to ensure the space is
> > allocated ... but that's just a couple more lines of code.
>
> Are OS's smart enough to not allocate zero-written blocks?

No, but some disks are. Writing zeroes is a bit faster on smart disks.
This has no real implications for PG, but it is one of the reasons that
writing zeroes doesn't really wipe a disk, for forensic purposes.

Nathan Myers
ncm(at)zembu(dot)com

From:	"Dominic J(dot) Eidson" <sauron(at)the-infinite(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, vadim4o(at)email(dot)com
Subject:	Re: WAL and commit_delay
Date:	2001-02-17 23:05:31
Message-ID:	Pine.LNX.4.21.0102171703510.19320-100000@morannon.the-infinite.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 17 Feb 2001, Tom Lane wrote:

> Another thing I am wondering about is why we're not using fdatasync(),
> where available, instead of fsync(). The whole point of preallocating
> the WAL files is to make fdatasync safe, no?

Linux/x86 fdatasync(2) manpage:

BUGS
Currently (Linux 2.0.23) fdatasync is equivalent to fsync.

--
Dominic J. Eidson
"Baruk Khazad! Khazad ai-menu!" - Gimli
-------------------------------------------------------------------------------
http://www.the-infinite.org/ http://www.the-infinite.org/~dominic/

From:	Brent Verner <brent(at)rcfile(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL and commit_delay
Date:	2001-02-17 23:30:12
Message-ID:	20010217183012.A24141@rcfile.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 17 Feb 2001 at 17:56 (-0500), Tom Lane wrote:

[snipped]

| Is anyone out there running a 2.4 Linux kernel? Would you try pgbench
| with current sources, commit_delay=0, -B at least 1024, no -F, and see
| how the results change when pg_fsync is made to call fdatasync instead
| of fsync? (It's in src/backend/storage/file/fd.c)

I've not run this requested test, but glibc-2.2 provides this bit
of code for fdatasync, so it /appears/ to me that kernel version
will not affect the test case.

[glibc-2.2/sysdeps/generic/fdatasync.c]

int
fdatasync (int fildes)
{
return fsync (fildes);
}

hth.
brent

--
"We want to help, but we wouldn't want to deprive you of a valuable
learning experience."
http://openbsd.org/mail.html

From:	ncm(at)zembu(dot)com (Nathan Myers)
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: WAL and commit_delay
Date:	2001-02-17 23:53:14
Message-ID:	20010217155314.C16600@store.zembu.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 17, 2001 at 06:30:12PM -0500, Brent Verner wrote:
> On 17 Feb 2001 at 17:56 (-0500), Tom Lane wrote:
>
> [snipped]
>
> | Is anyone out there running a 2.4 Linux kernel? Would you try pgbench
> | with current sources, commit_delay=0, -B at least 1024, no -F, and see
> | how the results change when pg_fsync is made to call fdatasync instead
> | of fsync? (It's in src/backend/storage/file/fd.c)
>
> I've not run this requested test, but glibc-2.2 provides this bit
> of code for fdatasync, so it /appears/ to me that kernel version
> will not affect the test case.
>
> [glibc-2.2/sysdeps/generic/fdatasync.c]
>
> int
> fdatasync (int fildes)
> {
> return fsync (fildes);
> }

In the 2.4 kernel it says (fs/buffer.c)

/* this needs further work, at the moment it is identical to fsync() */
down(&inode->i_sem);
err = file->f_op->fsync(file, dentry);
up(&inode->i_sem);

We can probably expect this to be fixed in an upcoming 2.4.x, i.e.
well before 2.6.

This is moot, though, if you're writing to a raw volume, which
you will be if you are really serious. Then, fsync really is
equivalent to fdatasync.

Nathan Myers
ncm(at)zembu(dot)com

From:	Brent Verner <brent(at)rcfile(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: WAL and commit_delay
Date:	2001-02-18 00:10:09
Message-ID:	20010217191009.A900@rcfile.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 17 Feb 2001 at 15:53 (-0800), Nathan Myers wrote:
| On Sat, Feb 17, 2001 at 06:30:12PM -0500, Brent Verner wrote:
| > On 17 Feb 2001 at 17:56 (-0500), Tom Lane wrote:
| >
| > [snipped]
| >
| > | Is anyone out there running a 2.4 Linux kernel? Would you try pgbench
| > | with current sources, commit_delay=0, -B at least 1024, no -F, and see
| > | how the results change when pg_fsync is made to call fdatasync instead
| > | of fsync? (It's in src/backend/storage/file/fd.c)
| >
| > I've not run this requested test, but glibc-2.2 provides this bit
| > of code for fdatasync, so it /appears/ to me that kernel version
| > will not affect the test case.
| >
| > [glibc-2.2/sysdeps/generic/fdatasync.c]
| >
| > int
| > fdatasync (int fildes)
| > {
| > return fsync (fildes);
| > }
|
| In the 2.4 kernel it says (fs/buffer.c)
|
| /* this needs further work, at the moment it is identical to fsync() */
| down(&inode->i_sem);
| err = file->f_op->fsync(file, dentry);
| up(&inode->i_sem);
|
| We can probably expect this to be fixed in an upcoming 2.4.x, i.e.
| well before 2.6.

2.4.0-ac11 already has provisions for fdatasync

[fs/buffer.c]

352 asmlinkage long sys_fsync(unsigned int fd)
353 {
...
372 down(&inode->i_sem);
373 filemap_fdatasync(inode->i_mapping);
374 err = file->f_op->fsync(file, dentry, 0);
375 filemap_fdatawait(inode->i_mapping);
376 up(&inode->i_sem);

384 asmlinkage long sys_fdatasync(unsigned int fd)
385 {
...
403 down(&inode->i_sem);
404 filemap_fdatasync(inode->i_mapping);
405 err = file->f_op->fsync(file, dentry, 1);
406 filemap_fdatawait(inode->i_mapping);
407 up(&inode->i_sem);

ext2 does use this third param of its fsync() operation to (potentially)
bypass a call to ext2_sync_inode(inode)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: WAL and commit_delay
Date:	2001-02-18 00:34:22
Message-ID:	6897.982456462@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

ncm(at)zembu(dot)com (Nathan Myers) writes:
> In the 2.4 kernel it says (fs/buffer.c)

> /* this needs further work, at the moment it is identical to fsync() */
> down(&inode->i_sem);
> err = file->f_op->fsync(file, dentry);
> up(&inode->i_sem);

Hmm, that's the same code that's been there since 2.0 or before.
I had trawled the Linux kernel mail lists and found patch submissions
from several different people to make fdatasync really work, and what
I thought was an indication that at least one had been applied.
Evidently not. Oh well...

regards, tom lane

From:	ncm(at)zembu(dot)com (Nathan Myers)
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: WAL and commit_delay
Date:	2001-02-18 02:13:19
Message-ID:	20010217181319.A16736@store.zembu.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 17, 2001 at 07:34:22PM -0500, Tom Lane wrote:
> ncm(at)zembu(dot)com (Nathan Myers) writes:
> > In the 2.4 kernel it says (fs/buffer.c)
>
> > /* this needs further work, at the moment it is identical to fsync() */
> > down(&inode->i_sem);
> > err = file->f_op->fsync(file, dentry);
> > up(&inode->i_sem);
>
> Hmm, that's the same code that's been there since 2.0 or before.

Indeed. All xterms look alike, and I used one connected to the wrong box.
Here's what's in 2.4.0:

For fsync:

filemap_fdatasync(inode->i_mapping);
err = file->f_op->fsync(file, dentry, 0);
filemap_fdatawait(inode->i_mapping);

and for fdatasync:

filemap_fdatasync(inode->i_mapping);
err = file->f_op->fsync(file, dentry, 1);
filemap_fdatawait(inode->i_mapping);

(Notice the "1" vs. "0" difference?) So the actual file system
(ext2fs, reiserfs, etc.) has the option of equating the two, or not.
In fs/ext2/fsync.c, we have

int ext2_fsync_inode(struct inode *inode, int datasync)
{
int err;
err = fsync_inode_buffers(inode);
if (!(inode->i_state & I_DIRTY))
return err;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
return err;
err |= ext2_sync_inode(inode);
return err ? -EIO : 0;
}

I.e. yes, Linux 2.4.0 and ext2 do implement the distinction.
Sorry for the misinformation.

Nathan Myers
ncm(at)zembu(dot)com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: WAL and commit_delay
Date:	2001-02-18 03:45:18
Message-ID:	8222.982467918@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

ncm(at)zembu(dot)com (Nathan Myers) writes:
> I.e. yes, Linux 2.4.0 and ext2 do implement the distinction.
> Sorry for the misinformation.

Okay ... meanwhile I've got to report the reverse: I've just confirmed
that on HPUX 10.20, there is *not* a distinction between fsync and
fdatasync. I was misled by what was apparently an outlier result on my
first try with fdatasync plugged in ... but when I couldn't reproduce
that, some digging led to the fact that the fsync and fdatasync symbols
in libc are at the same place :-(.

Still, using fdatasync for the WAL file seems like a forward-looking
thing to do, and it'll just take another couple of lines of configure
code, so I'll go ahead and plug it in.

regards, tom lane

From:	Adriaan Joubert <a(dot)joubert(at)albourne(dot)com>
To:
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL and commit_delay
Date:	2001-02-18 09:50:14
Message-ID:	3A8F9AD6.5700C2ED@albourne.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

fdatasync() is available on Tru64 and according to the man-page behaves
as Tom expects. So it should be a win for us. What do other commercial
unixes say?

Adriaan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Adriaan Joubert <a(dot)joubert(at)albourne(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: WAL and commit_delay
Date:	2001-02-18 16:51:50
Message-ID:	12206.982515110@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Adriaan Joubert <a(dot)joubert(at)albourne(dot)com> writes:
> fdatasync() is available on Tru64 and according to the man-page behaves
> as Tom expects. So it should be a win for us.

Careful ... HPUX's man page also claims that fdatasync does something
useful, but it doesn't. I'd recommend an experiment. Does today's
snapshot run any faster for you (without -F) than before?

regards, tom lane

From:	Larry Rosenman <ler(at)lerctr(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Adriaan Joubert <a(dot)joubert(at)albourne(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: WAL and commit_delay
Date:	2001-02-18 16:56:10
Message-ID:	20010218105610.B23980@lerami.lerctr.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> [010218 10:53]:
> Adriaan Joubert <a(dot)joubert(at)albourne(dot)com> writes:
> > fdatasync() is available on Tru64 and according to the man-page behaves
> > as Tom expects. So it should be a win for us.
>
> Careful ... HPUX's man page also claims that fdatasync does something
> useful, but it doesn't. I'd recommend an experiment. Does today's
> snapshot run any faster for you (without -F) than before?
BTW, UnixWare 7.1.1 does *NOT* have fdatasync. What standard created
this one?

>
> regards, tom lane
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: ler(at)lerctr(dot)org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

From:	Jerome Vouillon <vouillon(at)saul(dot)cis(dot)upenn(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL and commit_delay
Date:	2001-02-18 16:59:24
Message-ID:	d3z1ysw0vwz.fsf@saul.cis.upenn.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> The implication is that the only thing you can lose after fdatasync is
> the highly-inessential file mod time. However, I have been told that
> on some implementations, fdatasync only flushes data blocks, and never
> writes the inode or indirect blocks. That would mean that if you had
> allocated new disk space to the file, fdatasync would not guarantee
> that that allocation was reflected on disk. This is the reason for
> preallocating the WAL log file (and doing a full fsync *at that time*).
> Then you know the inode block pointers and indirect blocks are down
> on disk, and so fdatasync is sufficient even if you have the cheesy
> version of fdatasync.

Actually, there is also a performance reason. Indeed, fdatasync would
not perform any better than fsync if the log file was not
preallocated: the file length would change each time a record is
appended, and therefore the inode would have to be updated.

-- Jerome

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Larry Rosenman <ler(at)lerctr(dot)org>
Cc:	Adriaan Joubert <a(dot)joubert(at)albourne(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: WAL and commit_delay
Date:	2001-02-18 17:01:25
Message-ID:	12285.982515685@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Larry Rosenman <ler(at)lerctr(dot)org> writes:
> BTW, UnixWare 7.1.1 does *NOT* have fdatasync. What standard created
> this one?

HP's manpage quoth:

STANDARDS CONFORMANCE
fsync(): AES, SVID3, XPG3, XPG4, POSIX.4
fdatasync(): POSIX.4

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jerome Vouillon <vouillon(at)saul(dot)cis(dot)upenn(dot)edu>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL and commit_delay
Date:	2001-02-18 17:12:46
Message-ID:	12319.982516366@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jerome Vouillon <vouillon(at)saul(dot)cis(dot)upenn(dot)edu> writes:
> Actually, there is also a performance reason. Indeed, fdatasync would
> not perform any better than fsync if the log file was not
> preallocated: the file length would change each time a record is
> appended, and therefore the inode would have to be updated.

Good point, but seeking to the 16-meg position and writing one byte was
already sufficient to take care of that issue.

I think that there may be a performance advantage to pre-filling the
logfile even so, assuming that file allocation info is stored in a
Berkeley/McKusik-like fashion (note: I have no idea what ext2 or
reiserfs actually do). Namely, we'll only sync the file's indirect
blocks once, in the fsync() at the end of XLogFileInit. A correct
fdatasync implementation would have to sync the last indirect block each
time a new filesystem block is added to the logfile, so it would end up
doing a lot of seeks for that purpose even if it rarely touches the
inode itself. Another point is that if the logfile is pre-filled over a
short interval, its blocks are more likely to be allocated close to each
other than if it grows to full size over a longer interval. Not much
point in avoiding seeks outside the file data if the file data itself
is scattered all over the place :-(.

Basically we're trading more work in XLogFileInit (which we hope is not
time-critical) for less work in typical transaction commits.

regards, tom lane

From:	ncm(at)zembu(dot)com (Nathan Myers)
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: WAL and commit_delay
Date:	2001-02-18 20:08:02
Message-ID:	20010218120802.A31227@store.zembu.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 18, 2001 at 11:51:50AM -0500, Tom Lane wrote:
> Adriaan Joubert <a(dot)joubert(at)albourne(dot)com> writes:
> > fdatasync() is available on Tru64 and according to the man-page behaves
> > as Tom expects. So it should be a win for us.
>
> Careful ... HPUX's man page also claims that fdatasync does something
> useful, but it doesn't. I'd recommend an experiment. Does today's
> snapshot run any faster for you (without -F) than before?

It's worth noting in documentation that systems that don't have
fdatasync(), or that have the phony implementation, can get the same
benefit by using a raw volume (partition) for the log file. This
applies even on Linux 2.0 and 2.2 without the "raw-i/o" patch. Using
raw volumes would have other performance benefits, even on systems
that do fully support fdatasync, through bypassing the buffer cache.

(The above assumes I understood correctly Vadim's postings about
changes he made to support putting logs on raw volumes.)

Nathan Myers
ncm(at)zembu(dot)com

From:	Matthew Kirkwood <matthew(at)hairy(dot)beasts(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL and commit_delay
Date:	2001-02-19 13:29:00
Message-ID:	Pine.LNX.4.10.10102191239300.9444-300000@sphinx.mythic-beasts.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 18 Feb 2001, Tom Lane wrote:

> I think that there may be a performance advantage to pre-filling the
> logfile even so, assuming that file allocation info is stored in a
> Berkeley/McKusik-like fashion (note: I have no idea what ext2 or
> reiserfs actually do).

ext2 is a lot like [UF]FS. reiserfs is very different, but does
have similar hole semantics.

BTW, I have attached two patches which streamline log initialisation
a little. The first (xlog-sendfile.diff) adds support for Linux's
sendfile system call. FreeBSD and HP/UX have sendfile() too, but the
prototype is different. If it's interesting, someone will have to
come up with a configure test, as autoconf scares me.

The second removes a further three syscalls from the log init path.
There are a couple of things to note here:
* I don't know why link/unlink is currently preferred over
rename. POSIX offers strong guarantees on the semantics
of the latter.
* I have assumed that the close/rename/reopen stuff is only
there for the benefit of Windows users, and ifdeffed it
for everyone else.

Matthew.

Attachment	Content-Type	Size
xlog-sendfile.diff	text/plain	1.2 KB
xlog-streamline.diff	text/plain	645 bytes

From:	Matthew Kirkwood <matthew(at)hairy(dot)beasts(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL and commit_delay
Date:	2001-02-19 14:06:42
Message-ID:	Pine.LNX.4.10.10102191404170.11164-100000@sphinx.mythic-beasts.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 19 Feb 2001, Matthew Kirkwood wrote:

> BTW, I have attached two patches which streamline log initialisation
> a little. The first (xlog-sendfile.diff) adds support for Linux's
> sendfile system call.

Whoops, don't use this. It looks like Linux won't sendfile()
from /dev/zero. I'll endeavour to get this fixed, but it
looks like it'll be rather harder to use sendfile for this.

Bah.

Matthew.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Matthew Kirkwood <matthew(at)hairy(dot)beasts(dot)org>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL and commit_delay
Date:	2001-02-19 15:48:55
Message-ID:	24974.982597735@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Matthew Kirkwood <matthew(at)hairy(dot)beasts(dot)org> writes:
> BTW, I have attached two patches which streamline log initialisation
> a little. The first (xlog-sendfile.diff) adds support for Linux's
> sendfile system call. FreeBSD and HP/UX have sendfile() too, but the
> prototype is different. If it's interesting, someone will have to
> come up with a configure test, as autoconf scares me.

I think we don't want to mess with something as unportable as that
at this late stage of the release cycle (quite aside from your later
note that it doesn't work ;-)).

> The second removes a further three syscalls from the log init path.
> There are a couple of things to note here:
> * I don't know why link/unlink is currently preferred over
> rename. POSIX offers strong guarantees on the semantics
> of the latter.
> * I have assumed that the close/rename/reopen stuff is only
> there for the benefit of Windows users, and ifdeffed it
> for everyone else.

The reason for avoiding rename() is that the POSIX guarantees are
the wrong ones: specifically, rename promises to overwrite an existing
destination, which is exactly what we *don't* want. In theory two
backends cannot be executing this code in parallel, but if they were,
we would not want to destroy a logfile that perhaps already contains
WAL entries by the time we finish preparing our own logfile. link()
will fail if the destination name exists, which is a lot safer.

I'm not sure about the close/reopen stuff; I agree it looks unnecessary.
But this function is going to be so I/O bound (particularly now that
it fills the file) that two more kernel calls are insignificant.

regards, tom lane

From:	Jan Wieck <janwieck(at)Yahoo(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Adriaan Joubert <a(dot)joubert(at)albourne(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: WAL and commit_delay
Date:	2001-02-19 19:47:39
Message-ID:	200102191947.OAA01924@jupiter.jw.home
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Adriaan Joubert <a(dot)joubert(at)albourne(dot)com> writes:
> > fdatasync() is available on Tru64 and according to the man-page behaves
> > as Tom expects. So it should be a win for us.
>
> Careful ... HPUX's man page also claims that fdatasync does something
> useful, but it doesn't. I'd recommend an experiment. Does today's
> snapshot run any faster for you (without -F) than before?

IIRC your HPUX manpage states that fdatasync() updates only
required information to find back the data. It sounded to me
that HPUX distinguishes between irrelevant inode info (like
modtime) and important things (like blocks).

But maybe I'm confused by HP and they can still tell me an X
for an U.

Jan

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com