Re: sync_file_range()

Lists: pgsql-hackers
From: Christopher Kings-Lynne <chris(dot)kings-lynne(at)calorieking(dot)com>
To: Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: sync_file_range()
Date: 2006-06-19 05:46:30
Message-ID: 44963A36.4090606@calorieking.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

http://lwn.net/Articles/178199/

Check out the article on sync_file_range():

----
long sync_file_range(int fd, loff_t offset, loff_t nbytes, int flags);

This call will synchronize a file's data to disk, starting at the given
offset and proceeding for nbytes bytes (or to the end of the file if
nbytes is zero). How the synchronization is done is controlled by flags:

* SYNC_FILE_RANGE_WAIT_BEFORE blocks the calling process until any
already in-progress writeout of pages (in the given range) completes.

* SYNC_FILE_RANGE_WRITE starts writeout of any dirty pages in the
given range which are not already under I/O.

* SYNC_FILE_RANGE_WAIT_AFTER blocks the calling process until the
newly-initiated writes complete.

An application which wants to initiate writeback of all dirty pages
should provide the first two flags. Providing all three flags guarantees
that those pages are actually on disk when the call returns.
----

Is that at all useful for PostgreSQL's purposes?

Chris


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Christopher Kings-Lynne <chris(dot)kings-lynne(at)calorieking(dot)com>
Cc: Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sync_file_range()
Date: 2006-06-19 05:56:11
Message-ID: 20060619144521.9EAB.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Christopher Kings-Lynne <chris(dot)kings-lynne(at)calorieking(dot)com> wrote:

> http://lwn.net/Articles/178199/
> Check out the article on sync_file_range():

> Is that at all useful for PostgreSQL's purposes?

I'm interested in it, with which we could improve responsiveness during
checkpoints. Though it is Linux specific system call, but we could use
the combination of mmap() and msync() instead of it; I mean we can use
mmap only to flush dirty pages, not to read or write pages.

---
ITAGAKI Takahiro
NTT Open Source Software Center


From: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: sync_file_range()
Date: 2006-06-19 07:32:40
Message-ID: e75juu$1ou4$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> wrote
>
>
> I'm interested in it, with which we could improve responsiveness during
> checkpoints. Though it is Linux specific system call, but we could use
> the combination of mmap() and msync() instead of it; I mean we can use
> mmap only to flush dirty pages, not to read or write pages.
>

Can you specify details? As the TODO item inidcates, if we mmap data file, a
serious problem is that we don't know when the data pages hit the disks --
so that we may voilate the WAL rule.

Regards,
Qingqing


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: sync_file_range()
Date: 2006-06-19 10:33:44
Message-ID: 20060619184910.9EB3.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> wrote:

> > I'm interested in it, with which we could improve responsiveness during
> > checkpoints. Though it is Linux specific system call, but we could use
> > the combination of mmap() and msync() instead of it; I mean we can use
> > mmap only to flush dirty pages, not to read or write pages.
>
> Can you specify details? As the TODO item inidcates, if we mmap data file, a
> serious problem is that we don't know when the data pages hit the disks --
> so that we may voilate the WAL rule.

I'm thinking about fuzzy checkpoints, where we writes and flushes buffers
as need as we should. Then sync_file_range() helps us to control to flush
buffers by better granularity. We can stretch a checkpoint length to avoid
storage-overload at a burst, using sync_file_range() and cost-based delay,
like vacuum.

I did not want to modify buffers by mmap, just to say the following
pseudo-code. (I don't know it works in fact...)

my_sync_file_range(fd, offset, nbytes, ...)
{
void *p = mmap(NULL, nbytes, ..., fd, offset);
msync(p, nbytes, MS_ASYNC);
munmap(p, nbytes);
}

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: sync_file_range()
Date: 2006-06-19 11:29:15
Message-ID: 1150716555.2691.1132.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2006-06-19 at 15:32 +0800, Qingqing Zhou wrote:
> "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> wrote
> >
> >
> > I'm interested in it, with which we could improve responsiveness during
> > checkpoints. Though it is Linux specific system call, but we could use
> > the combination of mmap() and msync() instead of it; I mean we can use
> > mmap only to flush dirty pages, not to read or write pages.
> >
>
> Can you specify details? As the TODO item inidcates, if we mmap data file, a
> serious problem is that we don't know when the data pages hit the disks --
> so that we may voilate the WAL rule.

Can't see where we'd use it.

We fsync the xlog at transaction commit, so only the leading edge needs
to be synced - would the call help there? Presumably the OS can already
locate all blocks associated with a particular file fairly quickly
without doing a full cache scan.

Other files are fsynced at checkpoint - always all dirty blocks in the
whole file.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Florian Weimer <fw(at)deneb(dot)enyo(dot)de>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sync_file_range()
Date: 2006-06-19 16:47:00
Message-ID: 87u06hxfcr.fsf@mid.deneb.enyo.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Simon Riggs:

> Other files are fsynced at checkpoint - always all dirty blocks in the
> whole file.

Optionally, sync_file_range does not block the calling process, so
it's very easy to flush all files at once, which could in theory
reduce seeking overhead.


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sync_file_range()
Date: 2006-06-19 19:04:39
Message-ID: 8764ix55mg.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:

> On Mon, 2006-06-19 at 15:32 +0800, Qingqing Zhou wrote:
> > "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> wrote
> > >
> > >
> > > I'm interested in it, with which we could improve responsiveness during
> > > checkpoints. Though it is Linux specific system call, but we could use
> > > the combination of mmap() and msync() instead of it; I mean we can use
> > > mmap only to flush dirty pages, not to read or write pages.
> > >
> >
> > Can you specify details? As the TODO item inidcates, if we mmap data file, a
> > serious problem is that we don't know when the data pages hit the disks --
> > so that we may voilate the WAL rule.
>
> Can't see where we'd use it.
>
> We fsync the xlog at transaction commit, so only the leading edge needs
> to be synced - would the call help there? Presumably the OS can already
> locate all blocks associated with a particular file fairly quickly
> without doing a full cache scan.

Well in theory the transaction being committed isn't necessarily the "leading
edge", there could be more work from other transactions since the last work
this transaction actually did. However I can't see that actually helping
performance much if at all. There can't be much, and writing the data it
doesn't really matter much how much data it writes -- what really matters is
rotational and seek latency anyways.

> Other files are fsynced at checkpoint - always all dirty blocks in the
> whole file.

Well couldn't it be useful for checkpoints if it there was some way to know
which buffers had been touched since the last checkpoint? There could be a lot
of buffers dirtied since the checkpoint began and those don't really need to
be synced do they?

Or it could be used to control the rate at which the files are checkpointed.

Come to think of it I wonder whether there's anything to be gained by using
smaller files for tables. Instead of 1G files maybe 256M files or something
like that to reduce the hit of fsyncing a file.

--
greg


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sync_file_range()
Date: 2006-06-19 19:53:32
Message-ID: 1150746813.2587.98.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2006-06-19 at 15:04 -0400, Greg Stark wrote:

> > We fsync the xlog at transaction commit, so only the leading edge needs
> > to be synced - would the call help there? Presumably the OS can already
> > locate all blocks associated with a particular file fairly quickly
> > without doing a full cache scan.
>
> Well in theory the transaction being committed isn't necessarily the "leading
> edge", there could be more work from other transactions since the last work
> this transaction actually did.

Near enough.

> > Other files are fsynced at checkpoint - always all dirty blocks in the
> > whole file.
>
> Well couldn't it be useful for checkpoints if it there was some way to know
> which buffers had been touched since the last checkpoint? There could be a lot
> of buffers dirtied since the checkpoint began and those don't really need to
> be synced do they?

Qingqing had a proposal for something like that, but seemed not worth it
after analysis.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sync_file_range()
Date: 2006-06-20 01:35:30
Message-ID: 23179.1150767330@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Come to think of it I wonder whether there's anything to be gained by using
> smaller files for tables. Instead of 1G files maybe 256M files or something
> like that to reduce the hit of fsyncing a file.

Actually probably not. The weak part of our current approach is that we
tell the kernel "sync this file", then "sync that file", etc, in a more
or less random order. This leads to a probably non-optimal sequence of
disk accesses to complete a checkpoint. What we would really like is a
way to tell the kernel "sync all these files, and let me know when
you're done" --- then the kernel and hardware have some shot at
scheduling all the writes in an intelligent fashion.

sync_file_range() is not that exactly, but since it lets you request
syncing and then go back and wait for the syncs later, we could get the
desired effect with two passes over the file list. (If the file list
is longer than our allowed number of open files, though, the extra
opens/closes could hurt.)

Smaller files would make the I/O scheduling problem worse not better.
Indeed, I've been wondering lately if we shouldn't resurrect
LET_OS_MANAGE_FILESIZE and make that the default on systems with
largefile support. If nothing else it would cut down on open/close
overhead on very large relations.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sync_file_range()
Date: 2006-06-20 08:44:52
Message-ID: 1150793092.2587.134.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2006-06-19 at 21:35 -0400, Tom Lane wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > Come to think of it I wonder whether there's anything to be gained by using
> > smaller files for tables. Instead of 1G files maybe 256M files or something
> > like that to reduce the hit of fsyncing a file.

> sync_file_range() is not that exactly, but since it lets you request
> syncing and then go back and wait for the syncs later, we could get the
> desired effect with two passes over the file list. (If the file list
> is longer than our allowed number of open files, though, the extra
> opens/closes could hurt.)

So we would use the async properties of sync, but not the file range
support? Sounds like it could help with multiple filesystems.

> Indeed, I've been wondering lately if we shouldn't resurrect
> LET_OS_MANAGE_FILESIZE and make that the default on systems with
> largefile support. If nothing else it would cut down on open/close
> overhead on very large relations.

Agreed.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sync_file_range()
Date: 2006-06-20 13:52:24
Message-ID: 27505.1150811544@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> So we would use the async properties of sync, but not the file range
> support?

That's the part of it that looked potentially useful to me, anyway.
I don't see any value for us in syncing just part of a file, because
we don't have enough disk layout knowledge to make intelligent choices
of what to sync. I think the OP had some idea of having the bgwriter
write and then force-sync individual pages, but what good is that?
Once we've done the write() the page is exposed to the kernel's write
scheduler and should be written at an intelligent time. Trying to
force sync in advance of our own real need for it to be synced (ie
the next checkpoint) doesn't seem to me to offer any benefit.

regards, tom lane