Quick Links

Re: initdb and fsync

Lists:	pgsql-hackers

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	initdb and fsync
Date:	2012-01-28 00:19:41
Message-ID:	1327709981.9123.11.camel@sussancws0025
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

It looks like initdb doesn't fsync all the files it creates, e.g. the
PG_VERSION file.

While it's unlikely that it would cause any real data loss, it can be
inconvenient in some testing scenarios involving VMs.

Thoughts? Would a patch to add a few fsync calls to initdb be accepted?
Is a platform-independent fsync be available at initdb time?

Regards,
Jeff Davis

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-01-28 04:52:19
Message-ID:	20120128045219.GB29174@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 27, 2012 at 04:19:41PM -0800, Jeff Davis wrote:
> It looks like initdb doesn't fsync all the files it creates, e.g. the
> PG_VERSION file.
>
> While it's unlikely that it would cause any real data loss, it can be
> inconvenient in some testing scenarios involving VMs.
>
> Thoughts? Would a patch to add a few fsync calls to initdb be accepted?

+1. If I'm piloting "strace -f" right, initdb currently issues *no* syncs.

We'd probably, then, want a way to re-disable the fsyncs for hacker benefit.

> Is a platform-independent fsync be available at initdb time?

Not sure.

Thanks,
nm

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-01-28 15:31:57
Message-ID:	4F2414ED.40004@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/27/2012 11:52 PM, Noah Misch wrote:
>> Is a platform-independent fsync be available at initdb time?
> Not sure.
>

It's a macro on Windows that calls _commit(fd), so it should be portable
enough.

I'm curious what problem we're actually solving here, though. I've run
the buildfarm countless thousands of times on different VMs, and five of
my seven current animals run in VMs, and I don't think I've ever seen a
failure ascribable to inadequately synced files from initdb.

cheers

andrew

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-01-28 18:16:11
Message-ID:	1327774571.1734.32.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2012-01-28 at 10:31 -0500, Andrew Dunstan wrote:
> I'm curious what problem we're actually solving here, though. I've run
> the buildfarm countless thousands of times on different VMs, and five of
> my seven current animals run in VMs, and I don't think I've ever seen a
> failure ascribable to inadequately synced files from initdb.

I believe I have seen such a problem second hand in a situation where
the VM was known to be killed harshly (not sure if you do that
regularly).

It's a little difficult for me to _prove_ that this would have solved
the problem, and I think it was only observed once (though I could
probably reproduce it if I tried). The symptom was a log message
indicating that PG_VERSION was missing or corrupt on a system that was
previously started and online (albeit briefly for a test).

Regards,
Jeff Davis

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-01-28 18:18:00
Message-ID:	24164.1327774680@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> I'm curious what problem we're actually solving here, though. I've run
> the buildfarm countless thousands of times on different VMs, and five of
> my seven current animals run in VMs, and I don't think I've ever seen a
> failure ascribable to inadequately synced files from initdb.

Yeah. Personally I would be sad if initdb got noticeably slower, and
I've never seen or heard of a failure that this would fix.

I wonder whether it wouldn't be sufficient to call sync(2) at the end,
anyway, rather than cluttering the entire initdb codebase with fsync
calls.

regards, tom lane

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-01-28 18:27:14
Message-ID:	CAMkU=1yXVsN9m8MBcaouVRYwE0ZXeu6r10T64GS3i13Ftj=tgg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 28, 2012 at 7:31 AM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>
>
> On 01/27/2012 11:52 PM, Noah Misch wrote:
>>>
>>> Is a platform-independent fsync be available at initdb time?
>>
>> Not sure.
>>
>
> It's a macro on Windows that calls _commit(fd), so it should be portable
> enough.
>
> I'm curious what problem we're actually solving here, though. I've run the
> buildfarm countless thousands of times on different VMs, and five of my
> seven current animals run in VMs, and I don't think I've ever seen a failure
> ascribable to inadequately synced files from initdb.

I wouldn't expect you to ever see that problem on the buildfarm. If
the OS gets thunked during the middle of a regression test, when it
comes back up the code is not going to try to pick up where it left
off, it is just going to blow away the entire install and start over
from scratch. So any crash-recoverability problems will never be
detected. I would guess the original poster is doing a more stringent
kind of test.

Cheers,

Jeff

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-01-28 18:46:06
Message-ID:	1327776366.1734.39.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> > I'm curious what problem we're actually solving here, though. I've run
> > the buildfarm countless thousands of times on different VMs, and five of
> > my seven current animals run in VMs, and I don't think I've ever seen a
> > failure ascribable to inadequately synced files from initdb.
>
> Yeah. Personally I would be sad if initdb got noticeably slower, and
> I've never seen or heard of a failure that this would fix.
>
> I wonder whether it wouldn't be sufficient to call sync(2) at the end,
> anyway, rather than cluttering the entire initdb codebase with fsync
> calls.

I can always add a "sync" call to the test, also (rather than modifying
initdb). Or, it could be an initdb option, which might be a good
compromise. I don't have a strong opinion here.

As machines get more memory and filesystems get more lazy, I wonder if
it will be a more frequent occurrence, however. On the other hand, if
filesystems are more lazy, that also increases the cost associated with
extra "sync" calls. I think there would be a surprise factor if
sometimes initdb had a long pause at the end and caused 10GB of data to
be written out.

Regards,
Jeff Davis

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-01-28 18:57:03
Message-ID:	CAMkU=1wSfA65kz6zwDCjNV4AyeBp=Hnv74g_Ok4PFQdagqyOvg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 28, 2012 at 10:18 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> I'm curious what problem we're actually solving here, though. I've run
>> the buildfarm countless thousands of times on different VMs, and five of
>> my seven current animals run in VMs, and I don't think I've ever seen a
>> failure ascribable to inadequately synced files from initdb.
>
> Yeah. Personally I would be sad if initdb got noticeably slower, and
> I've never seen or heard of a failure that this would fix.
>
> I wonder whether it wouldn't be sufficient to call sync(2) at the end,
> anyway, rather than cluttering the entire initdb codebase with fsync
> calls.
>
> regards, tom lane

Does sync(2) behave like sync(8) and flush the entire system cache, or
does it just flush the files opened by the process which called it?

The man page didn't enlighten me on that.

sometimes sync(8) never returns. It doesn't just flush what was dirty
at the time it was called, it actually keeps running until there are
simultaneously no dirty pages anywhere in the system. On busy
systems, this condition might never be reached. And it can't be
interrupted, not even with kill -9.

Cheers,

Jeff

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-01-28 18:59:51
Message-ID:	4F2445A7.60407@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/28/2012 01:46 PM, Jeff Davis wrote:
> On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote:
>> Andrew Dunstan<andrew(at)dunslane(dot)net> writes:
>>> I'm curious what problem we're actually solving here, though. I've run
>>> the buildfarm countless thousands of times on different VMs, and five of
>>> my seven current animals run in VMs, and I don't think I've ever seen a
>>> failure ascribable to inadequately synced files from initdb.
>> Yeah. Personally I would be sad if initdb got noticeably slower, and
>> I've never seen or heard of a failure that this would fix.
>>
>> I wonder whether it wouldn't be sufficient to call sync(2) at the end,
>> anyway, rather than cluttering the entire initdb codebase with fsync
>> calls.
> I can always add a "sync" call to the test, also (rather than modifying
> initdb). Or, it could be an initdb option, which might be a good
> compromise. I don't have a strong opinion here.
>
> As machines get more memory and filesystems get more lazy, I wonder if
> it will be a more frequent occurrence, however. On the other hand, if
> filesystems are more lazy, that also increases the cost associated with
> extra "sync" calls. I think there would be a surprise factor if
> sometimes initdb had a long pause at the end and caused 10GB of data to
> be written out.
>

-1 for that. A very quick look at initdb.c suggests to me that there are
only two places where we'd need to put fsync(), right before we call
fclose() in write_file() and write_version_file(). If we're going to do
anything that seems to be the least painful and most portable way to go.

cheers

andrew

From:	Florian Weimer <fw(at)deneb(dot)enyo(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-02-04 21:20:05
Message-ID:	87y5sinvne.fsf@mid.deneb.enyo.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane:

> I wonder whether it wouldn't be sufficient to call sync(2) at the end,
> anyway, rather than cluttering the entire initdb codebase with fsync
> calls.

We tried to do this in the Debian package mananger. It works as
expected on Linux systems, but it can cause a lot of data to hit the
disk, and there are kernel versions where sync(2) never completes if
the system is rather busy.

initdb is much faster with 9.1 than with 8.4. It's so fast that you
can use it in test suites, instead of reusing an existing cluster.
I think this is a rather desirable property.

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-02-04 23:41:27
Message-ID:	1328398887.15224.17.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote:
> Yeah. Personally I would be sad if initdb got noticeably slower, and
> I've never seen or heard of a failure that this would fix.

I worked up a patch, and it looks like it does about 6 file fsync's and
a 7th for the PGDATA directory. That degrades the time from about 1.1s
to 1.4s on my workstation.

pg_test_fsync says this about my workstation (one 8kB write):
open_datasync 117.495 ops/sec
fdatasync 117.949 ops/sec
fsync 25.530 ops/sec
fsync_writethrough n/a
open_sync 24.666 ops/sec

25 ops/sec means about 40ms per fsync, times 7 is about 280ms, so that
seems like about the right degradation for fsync. I tried with fdatasync
as well to see if it improved things, and I wasn't able to realize any
difference (not sure exactly why).

So, is it worth it? Should we make it an option that can be specified?

> I wonder whether it wouldn't be sufficient to call sync(2) at the end,
> anyway, rather than cluttering the entire initdb codebase with fsync
> calls.

It looks like there are only a few places, so I don't think clutter is
really the problem with the simple patch at this point (unless there is
a portability problem with just calling fsync).

Regards,
Jeff Davis

Attachment	Content-Type	Size
initdb-fsync.patch.gz	application/x-gzip	757 bytes

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-02-05 01:18:49
Message-ID:	20120205011849.GC24073@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 04, 2012 at 03:41:27PM -0800, Jeff Davis wrote:
> On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote:
> > Yeah. Personally I would be sad if initdb got noticeably slower, and
> > I've never seen or heard of a failure that this would fix.
>
> I worked up a patch, and it looks like it does about 6 file fsync's and
> a 7th for the PGDATA directory. That degrades the time from about 1.1s
> to 1.4s on my workstation.

> So, is it worth it? Should we make it an option that can be specified?

If we add fsync calls to the initdb process, they should cover the entire data
directory tree. This patch syncs files that initdb.c writes, but we ought to
also sync files that bootstrap-mode backends had written. An optimization
like the pg_flush_data() call in copy_file() may reduce the speed penalty.

initdb should do these syncs by default and offer an option to disable them.

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-02-05 18:53:20
Message-ID:	1328468000.15224.32.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote:
> If we add fsync calls to the initdb process, they should cover the entire data
> directory tree. This patch syncs files that initdb.c writes, but we ought to
> also sync files that bootstrap-mode backends had written.

It doesn't make sense for initdb to take responsibility to sync files
created by the backend. If there are important files that the backend
creates, it should be the backend's responsibility to fsync them (and
their parent directory, if needed). And if they are unimportant to the
backend, then there is no reason for initdb to fsync them.

> An optimization
> like the pg_flush_data() call in copy_file() may reduce the speed penalty.

That worked pretty well. It took it down about 100ms on my machine,
which closes the gap significantly.

> initdb should do these syncs by default and offer an option to disable them.

For test frameworks that run initdb often, that makes sense.

But for developers, it doesn't make sense to spend 0.5s typing an option
that saves you 0.3s. So, we'd need some more convenient way to choose
the no-fsync option, like an environment variable that developers can
set. Or maybe developers don't care about 0.3s?

Regards,
Jeff Davis

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-02-05 19:50:20
Message-ID:	18813.1328471420@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote:
>> If we add fsync calls to the initdb process, they should cover the entire data
>> directory tree. This patch syncs files that initdb.c writes, but we ought to
>> also sync files that bootstrap-mode backends had written.

> It doesn't make sense for initdb to take responsibility to sync files
> created by the backend.

No, but the more interesting question is whether bootstrap mode troubles
to fsync its writes. I'm not too sure about that either way ...

regards, tom lane

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-02-05 22:56:57
Message-ID:	20120205225657.GA2911@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 05, 2012 at 10:53:20AM -0800, Jeff Davis wrote:
> On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote:
> > If we add fsync calls to the initdb process, they should cover the entire data
> > directory tree. This patch syncs files that initdb.c writes, but we ought to
> > also sync files that bootstrap-mode backends had written.
>
> It doesn't make sense for initdb to take responsibility to sync files
> created by the backend. If there are important files that the backend
> creates, it should be the backend's responsibility to fsync them (and
> their parent directory, if needed). And if they are unimportant to the
> backend, then there is no reason for initdb to fsync them.

I meant primarily to illustrate the need to be comprehensive, not comment on
which executable should fsync a particular file. Bootstrap-mode backends do
not sync anything during an initdb run on my system. With your patch, we'll
fsync a small handful of files and leave nearly everything else vulnerable.

That being said, having each backend fsync its own writes will mean syncing
certain files several times within a single initdb run. If the penalty from
that proves high enough, we may do well to instead have initdb.c sync
everything just once.

> > initdb should do these syncs by default and offer an option to disable them.
>
> For test frameworks that run initdb often, that makes sense.
>
> But for developers, it doesn't make sense to spend 0.5s typing an option
> that saves you 0.3s. So, we'd need some more convenient way to choose
> the no-fsync option, like an environment variable that developers can
> set. Or maybe developers don't care about 0.3s?

Developers have shell aliases/functions/scripts and command history. I
wouldn't object to having an environment variable control it, but I would not
personally find that more convenient than a command-line switch.

Thanks,
nm

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-02-10 20:57:52
Message-ID:	1328907472.5373.4.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On sön, 2012-02-05 at 10:53 -0800, Jeff Davis wrote:
> > initdb should do these syncs by default and offer an option to
> disable them.
>
> For test frameworks that run initdb often, that makes sense.
>
> But for developers, it doesn't make sense to spend 0.5s typing an
> option
> that saves you 0.3s. So, we'd need some more convenient way to choose
> the no-fsync option, like an environment variable that developers can
> set. Or maybe developers don't care about 0.3s?
>
You can use https://launchpad.net/libeatmydata for those cases.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-02-13 13:14:52
Message-ID:	CA+TgmoaxXzS5pC1OmyTiVzzDsBS7ygDCOSR7a1HTQbsRVEQNDw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 10, 2012 at 3:57 PM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On sön, 2012-02-05 at 10:53 -0800, Jeff Davis wrote:
>> > initdb should do these syncs by default and offer an option to
>> disable them.
>>
>> For test frameworks that run initdb often, that makes sense.
>>
>> But for developers, it doesn't make sense to spend 0.5s typing an
>> option
>> that saves you 0.3s. So, we'd need some more convenient way to choose
>> the no-fsync option, like an environment variable that developers can
>> set. Or maybe developers don't care about 0.3s?
>>
> You can use https://launchpad.net/libeatmydata for those cases.

That's hilarious.

But, a command-line option seems more convenient.

It also seems entirely sufficient. The comments above suggest that it
would take too long to type the option, but any PG developers who are
worried about the speed difference surely know how to create shell
aliases, shell functions, shell scripts, ... and if anyone's really
concerned about it, we can provide a short form for the option.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-03-13 03:49:40
Message-ID:	1331610580.6425.98.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2012-02-05 at 17:56 -0500, Noah Misch wrote:
> I meant primarily to illustrate the need to be comprehensive, not comment on
> which executable should fsync a particular file. Bootstrap-mode backends do
> not sync anything during an initdb run on my system. With your patch, we'll
> fsync a small handful of files and leave nearly everything else vulnerable.

Thank you for pointing that out. With that in mind, I have a new version
of the patch which just recursively fsync's the whole directory
(attached).

I also introduced a new option --nosync (-N) to disable this behavior.

The bad news is that it introduces a lot more time to initdb -- it goes
from about 1s to about 10s on my machine. I tried fsync'ing the whole
directory twice just to make sure that the second was a no-op, and
indeed it didn't make much difference (still about 10s).

That's pretty inefficient considering that

initdb -D data --nosync && sync

only takes a couple seconds. Clearly batching the operation is a big
help. Maybe there's some more efficient way to fsync a lot of
files/directories? Or maybe I can mitigate it by avoiding files that
don't really need to be fsync'd?

Regards,
Jeff Davis

Attachment	Content-Type	Size
initdb-fsync-20120312.patch.gz	application/x-gzip	2.2 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-03-13 08:42:03
Message-ID:	201203130942.03551.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tuesday, March 13, 2012 04:49:40 AM Jeff Davis wrote:
> On Sun, 2012-02-05 at 17:56 -0500, Noah Misch wrote:
> > I meant primarily to illustrate the need to be comprehensive, not comment
> > on which executable should fsync a particular file. Bootstrap-mode
> > backends do not sync anything during an initdb run on my system. With
> > your patch, we'll fsync a small handful of files and leave nearly
> > everything else vulnerable.
>
> Thank you for pointing that out. With that in mind, I have a new version
> of the patch which just recursively fsync's the whole directory
> (attached).
>
> I also introduced a new option --nosync (-N) to disable this behavior.
>
> The bad news is that it introduces a lot more time to initdb -- it goes
> from about 1s to about 10s on my machine. I tried fsync'ing the whole
> directory twice just to make sure that the second was a no-op, and
> indeed it didn't make much difference (still about 10s).
I suggest you try making it two loops:

for recursively everything in dir:
posix_fadvise(fd, POSIX_FADV_DONTNEED);

for recursively everything in dir:
fsync(fd);

In my experience that gives way much better performance due to the fact that
it does not force its own metadata/journal commit/transaction for every file
but can be batched. copydir() does the same since some releases...

Obviously its not that nice to use _DONTNEED but I havent found something that
works equally well. You could try sync_file_range(fd, 0, 0,
SYNC_FILE_RANGE_WRITE) in the first loop but my experience with that hasn't
been that good.

Andres

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-03-14 04:23:03
Message-ID:	1331698983.6425.108.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote:
> for recursively everything in dir:
> posix_fadvise(fd, POSIX_FADV_DONTNEED);
>
> for recursively everything in dir:
> fsync(fd);

Wow, that made a huge difference!

no sync: ~ 1.0s
sync: ~10.0s
fadvise+sync: ~ 1.3s

Patch attached.

Now I feel much better about it. Most people will either have fadvise, a
write cache (rightly or wrongly), or actually need the sync. Those that
have none of those can use -N.

Regards,
Jeff Davis

Attachment	Content-Type	Size
initdb-fsync-20120313.patch.gz	application/x-gzip	2.5 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-03-14 09:26:17
Message-ID:	201203141026.17799.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wednesday, March 14, 2012 05:23:03 AM Jeff Davis wrote:
> On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote:
> > for recursively everything in dir:
> > posix_fadvise(fd, POSIX_FADV_DONTNEED);
> >
> > for recursively everything in dir:
> > fsync(fd);
>
> Wow, that made a huge difference!
>
> no sync: ~ 1.0s
> sync: ~10.0s
> fadvise+sync: ~ 1.3s
>
> Patch attached.
>
> Now I feel much better about it. Most people will either have fadvise, a
> write cache (rightly or wrongly), or actually need the sync. Those that
> have none of those can use -N.
Well, while the positive effect of this are rather large it also has the bad
effect of pushing the whole new database out of the cache. Which is not so nice
if you want to run tests on it seconds later.
How are the results with sync_file_range(fd, 0, 0,
SYNC_FILE_RANGE_WRITE)?

Andres

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-03-15 06:38:38
Message-ID:	1331793518.824.35.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2012-03-14 at 10:26 +0100, Andres Freund wrote:
> On Wednesday, March 14, 2012 05:23:03 AM Jeff Davis wrote:
> > On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote:
> > > for recursively everything in dir:
> > > posix_fadvise(fd, POSIX_FADV_DONTNEED);
> > >
> > > for recursively everything in dir:
> > > fsync(fd);
> >
> > Wow, that made a huge difference!
> >
> > no sync: ~ 1.0s
> > sync: ~10.0s
> > fadvise+sync: ~ 1.3s

I take that back. There was something wrong with my test -- fadvise
helps, but it only takes it from ~10s to ~6.5s. Not quite as good as I
hoped.

> Well, while the positive effect of this are rather large it also has the bad
> effect of pushing the whole new database out of the cache. Which is not so nice
> if you want to run tests on it seconds later.

I was unable to see a regression when it comes to starting it up after
the fadvise+fsync. My test just started the server, created a table,
then stopped the server. It was actually a hair faster with the
directory that had been fadvise'd and then fsync'd, but I assume that
was noise. Regardless, this doesn't look like an issue.

> How are the results with sync_file_range(fd, 0, 0,
> SYNC_FILE_RANGE_WRITE)?

That is much faster than using fadvise. It goes down to ~2s.

Unfortunately, that's non-portable. Any other ideas? 6.5s a little on
the annoying side (and causes some disconcerting sounds to come from my
disk), especially when we _know_ it can be done in 2s.

Anyway, updated patch attached.

Regards,
Jeff Davis

Attachment	Content-Type	Size
initdb-fsync-20120314.patch.gz	application/x-gzip	2.6 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-03-16 10:25:01
Message-ID:	201203161125.02173.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, March 15, 2012 07:38:38 AM Jeff Davis wrote:
> On Wed, 2012-03-14 at 10:26 +0100, Andres Freund wrote:
> > On Wednesday, March 14, 2012 05:23:03 AM Jeff Davis wrote:
> > > On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote:
> > > > for recursively everything in dir:
> > > > posix_fadvise(fd, POSIX_FADV_DONTNEED);
> > > >
> > > > for recursively everything in dir:
> > > > fsync(fd);
> > >
> > > Wow, that made a huge difference!
> > >
> > > no sync: ~ 1.0s
> > > sync: ~10.0s
> > > fadvise+sync: ~ 1.3s
>
> I take that back. There was something wrong with my test -- fadvise
> helps, but it only takes it from ~10s to ~6.5s. Not quite as good as I
> hoped.
Thats surprising. I wouldn't expect such a big difference between fadvise +
sync_file_range. Rather strange.

> > Well, while the positive effect of this are rather large it also has the
> > bad effect of pushing the whole new database out of the cache. Which is
> > not so nice if you want to run tests on it seconds later.
>
> I was unable to see a regression when it comes to starting it up after
> the fadvise+fsync. My test just started the server, created a table,
> then stopped the server. It was actually a hair faster with the
> directory that had been fadvise'd and then fsync'd, but I assume that
> was noise. Regardless, this doesn't look like an issue.
Hm. Ok.

> > How are the results with sync_file_range(fd, 0, 0,
> > SYNC_FILE_RANGE_WRITE)?
> That is much faster than using fadvise. It goes down to ~2s.

> Unfortunately, that's non-portable. Any other ideas? 6.5s a little on
> the annoying side (and causes some disconcerting sounds to come from my
> disk), especially when we _know_ it can be done in 2s.
Its not like posix_fadvise is actually portable. So I personally don't see a
problem with that, but...

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-03-16 15:47:06
Message-ID:	CA+TgmoYqLpagU3GSvvxbHPs9UhQq4QSYCvdD=qj6UiAyPFJTEw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 16, 2012 at 6:25 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> > How are the results with sync_file_range(fd, 0, 0,
>> > SYNC_FILE_RANGE_WRITE)?
>> That is much faster than using fadvise. It goes down to ~2s.
>
>> Unfortunately, that's non-portable. Any other ideas? 6.5s a little on
>> the annoying side (and causes some disconcerting sounds to come from my
>> disk), especially when we _know_ it can be done in 2s.
> Its not like posix_fadvise is actually portable. So I personally don't see a
> problem with that, but...

Well, sync_file_range only works on Linux, and will probably never
work anywhere else. posix_fadvise() at least has a chance of being
supported on other platforms, being a standard and all that. Though I
see that my Mac has neither. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-03-16 15:51:04
Message-ID:	201203161651.05074.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday, March 16, 2012 04:47:06 PM Robert Haas wrote:
> On Fri, Mar 16, 2012 at 6:25 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> > How are the results with sync_file_range(fd, 0, 0,
> >> > SYNC_FILE_RANGE_WRITE)?
> >>
> >> That is much faster than using fadvise. It goes down to ~2s.
> >>
> >> Unfortunately, that's non-portable. Any other ideas? 6.5s a little on
> >> the annoying side (and causes some disconcerting sounds to come from my
> >> disk), especially when we _know_ it can be done in 2s.
> >
> > Its not like posix_fadvise is actually portable. So I personally don't
> > see a problem with that, but...
>
> Well, sync_file_range only works on Linux, and will probably never
> work anywhere else. posix_fadvise() at least has a chance of being
> supported on other platforms, being a standard and all that. Though I
> see that my Mac has neither. :-(
I would suggest adding a wrapper function like:
pg_hint_writeback_flush(fd, off, len);

which then is something like

#if HAVE_SYNC_FILE_RANGE
sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE);
#elseif HAVE_POSIX_FADVISE
posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
#else
#endif

To my knowledge posix_fadvise currently is only supported on linux btw...

Andres

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-03-16 17:14:49
Message-ID:	1331918089.824.51.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2012-03-16 at 11:25 +0100, Andres Freund wrote:
> > I take that back. There was something wrong with my test -- fadvise
> > helps, but it only takes it from ~10s to ~6.5s. Not quite as good as I
> > hoped.
> Thats surprising. I wouldn't expect such a big difference between fadvise +
> sync_file_range. Rather strange.

I discussed this with my colleague who knows linux internals, and he
pointed me directly at the problem. fadvise and sync_file_range in this
case are both trying to put the data in the io scheduler queue (still in
the kernel, not on the device). The difference is that fadvise doesn't
wait, and sync_file_range does (keep in mind, this is waiting to get in
a queue to go to the device, not waiting for the device to write it or
even receive it).

He indicated that 4096 is a normal number that one might use for the
queue size. But on my workstation at home (ubuntu 11.10), the queue is
only 128. I bumped it up to 256 and now fadvise is just as fast!

This won't be a problem on production systems, but that doesn't help us
a lot. People setting up a production system don't care about 6.5
seconds of set-up time anyway. Casual users and developers do (the
latter problem can be solved with the --nosync switch, but the former
problem is the one we're discussing).

So, it looks like fadvise is the "right" thing to do, but I expect we'll
get some widely varying results from actual users. Then again, maybe
casual users don't care much about ~10s for initdb anyway? It's a fairly
rare operation for everyone except developers.

Regards,
Jeff Davis

From:	Cédric Villemain <cedric(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-03-17 16:48:38
Message-ID:	201203171748.38738.cedric@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Le vendredi 16 mars 2012 16:51:04, Andres Freund a écrit :
> On Friday, March 16, 2012 04:47:06 PM Robert Haas wrote:
> > On Fri, Mar 16, 2012 at 6:25 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > >> > How are the results with sync_file_range(fd, 0, 0,
> > >> > SYNC_FILE_RANGE_WRITE)?
> > >>
> > >> That is much faster than using fadvise. It goes down to ~2s.
> > >>
> > >> Unfortunately, that's non-portable. Any other ideas? 6.5s a little on
> > >> the annoying side (and causes some disconcerting sounds to come from
> > >> my disk), especially when we _know_ it can be done in 2s.
> > >
> > > Its not like posix_fadvise is actually portable. So I personally don't
> > > see a problem with that, but...
> >
> > Well, sync_file_range only works on Linux, and will probably never
> > work anywhere else. posix_fadvise() at least has a chance of being
> > supported on other platforms, being a standard and all that. Though I
> > see that my Mac has neither. :-(
>
> I would suggest adding a wrapper function like:
> pg_hint_writeback_flush(fd, off, len);
>
> which then is something like
>
> #if HAVE_SYNC_FILE_RANGE
> sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE);
> #elseif HAVE_POSIX_FADVISE
> posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
> #else
> #endif
>
> To my knowledge posix_fadvise currently is only supported on linux btw...

I agree with Andres.

I believe we should use sync_file_range (_before?) with linux.
And we can use posix_fadvise_dontneed on other kernels.

FADVISE_DONTNEED does start a writeback (it may decide not to do it too), but
the primer objective of posix_fadvise_dontneed is not to make sync() faster.
We just have writeback and sync() calls challenged together and we can face
situation where linux does not handle that so well. (depends on linux 2.6.18
or 32 or 3.2 or ...)

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	cedric(at)2ndquadrant(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-03-26 02:59:36
Message-ID:	1332730776.8251.93.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2012-03-17 at 17:48 +0100, Cédric Villemain wrote:
> I agree with Andres.
>
>
> I believe we should use sync_file_range (_before?) with linux.
>
> And we can use posix_fadvise_dontneed on other kernels.
>
OK, updated patch attached. sync_file_range() is preferred,
posix_fadvise() is a fallback.

Regards,
Jeff Davis

Attachment	Content-Type	Size
initdb-fsync-20120325.patch.gz	application/x-gzip	4.3 KB

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-06-13 04:09:11
Message-ID:	1339560551.19955.1.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2012-03-25 at 19:59 -0700, Jeff Davis wrote:
> On Sat, 2012-03-17 at 17:48 +0100, Cédric Villemain wrote:
> > I agree with Andres.
> >
> >
> > I believe we should use sync_file_range (_before?) with linux.
> >
> > And we can use posix_fadvise_dontneed on other kernels.
> >
> OK, updated patch attached. sync_file_range() is preferred,
> posix_fadvise() is a fallback.
>

Rebased patch attached. No other changes.

Regards,
Jeff Davis

Attachment	Content-Type	Size
initdb-fsync-20120612.patch.gz	application/x-gzip	4.3 KB

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-06-13 10:53:03
Message-ID:	1339584783.11971.28.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On tis, 2012-06-12 at 21:09 -0700, Jeff Davis wrote:
> On Sun, 2012-03-25 at 19:59 -0700, Jeff Davis wrote:
> > On Sat, 2012-03-17 at 17:48 +0100, Cédric Villemain wrote:
> > > I agree with Andres.
> > >
> > >
> > > I believe we should use sync_file_range (_before?) with linux.
> > >
> > > And we can use posix_fadvise_dontneed on other kernels.
> > >
> > OK, updated patch attached. sync_file_range() is preferred,
> > posix_fadvise() is a fallback.
> >
>
> Rebased patch attached. No other changes.

The --help output for the -N option was copy-and-pasted wrongly.

The message issued when using -N is also a bit content-free. Maybe
something like

"Running in nosync mode. The data directory might become corrupt if the
operating system crashes.\n"

Which leads to the question, how does one get out of this state? Is
running sync(1) enough? Is starting the postgres server enough?

There are no updates to the initdb man page included in your patch,
which would be a suitable place to discuss this briefly.

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-06-13 16:53:17
Message-ID:	1339606397.23449.13.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2012-06-13 at 13:53 +0300, Peter Eisentraut wrote:
> The --help output for the -N option was copy-and-pasted wrongly.
>
> The message issued when using -N is also a bit content-free. Maybe
> something like
>
> "Running in nosync mode. The data directory might become corrupt if the
> operating system crashes.\n"

Thank you, fixed.

> Which leads to the question, how does one get out of this state? Is
> running sync(1) enough? Is starting the postgres server enough?

sync(1) calls sync(2), and the man page says:

"According to the standard specification (e.g., POSIX.1-2001), sync()
schedules the writes, but may return before the actual writing is done.
However, since version 1.3.20 Linux does actually wait. (This still
does not guarantee data integrity: modern disks have large caches.)"

So it looks like sync is enough if you are on linux *and* you have any
unprotected write cache disabled.

I don't think starting the postgres server is enough.

Before, I think we were safe because we could assume that the OS would
flush the buffers before you had time to store any important data. But
now, that window can be much larger.

> There are no updates to the initdb man page included in your patch,
> which would be a suitable place to discuss this briefly.

Thank you, I added that.

Regards,
Jeff Davis

Attachment	Content-Type	Size
initdb-fsync-20120613.patch.gz	application/x-gzip	4.6 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-06-18 16:05:29
Message-ID:	201206181805.29195.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wednesday, June 13, 2012 06:53:17 PM Jeff Davis wrote:
> On Wed, 2012-06-13 at 13:53 +0300, Peter Eisentraut wrote:
> > The --help output for the -N option was copy-and-pasted wrongly.
> >
> > The message issued when using -N is also a bit content-free. Maybe
> > something like
> >
> > "Running in nosync mode. The data directory might become corrupt if the
> > operating system crashes.\n"
>
> Thank you, fixed.
>
> > Which leads to the question, how does one get out of this state? Is
> > running sync(1) enough? Is starting the postgres server enough?
>
> sync(1) calls sync(2), and the man page says:
>
> "According to the standard specification (e.g., POSIX.1-2001), sync()
> schedules the writes, but may return before the actual writing is done.
> However, since version 1.3.20 Linux does actually wait. (This still
> does not guarantee data integrity: modern disks have large caches.)"
>
> So it looks like sync is enough if you are on linux *and* you have any
> unprotected write cache disabled.
Protection can include write barries, it doesn't need to be a BBU....

> I don't think starting the postgres server is enough.
Agreed.

> Before, I think we were safe because we could assume that the OS would
> flush the buffers before you had time to store any important data. But
> now, that window can be much larger.
I think to a large degree we didn't see any problem because of the old ext3
"sync the whole world" type of behaviour which was common for a very long
time.
So on ext3 (with data=ordered, the default) any fsync (checkpoints, commit,
...) would lead to all the other files being synced as well which means
starting the server once would have been enough on such a system...

Quick review:
- defaulting to initdb -N in the regression suite is not a good imo, because
that way the buildfarm won't catch problems in that area...
- could the copydir.c and initdb.c versions of walkdir/sync_fname et al be
unified?
- I personally would find it way nicer to put USE_PRE_SYNC into pre_sync_fname
instead of cluttering the main function with it

Looks good otherwise!

Thanks,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject:	Re: initdb and fsync
Date:	2012-06-18 18:34:30
Message-ID:	1340044470.26286.1.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
> - defaulting to initdb -N in the regression suite is not a good imo,
> because that way the buildfarm won't catch problems in that area...
>
The regression test suite also starts postgres with fsync off. This is
good, and I don't want to have more overhead in the tests. It's not
like we already have awesome test coverage for OS interaction.

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-06-18 18:39:47
Message-ID:	1340044787.19023.47.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
> Quick review:
> - defaulting to initdb -N in the regression suite is not a good imo, because
> that way the buildfarm won't catch problems in that area...

I removed the -N as you suggest. How much does performance matter on the
buildfarm?

> - could the copydir.c and initdb.c versions of walkdir/sync_fname et al be
> unified?

There's a lot of backend-specific code in the copydir versions, like
using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
unifying them before, and concluded that it wouldn't add to the
readability, so I just commented where they came from.

If you feel there will be a maintainability problem, I can give it
another shot.

> - I personally would find it way nicer to put USE_PRE_SYNC into pre_sync_fname
> instead of cluttering the main function with it

Done.

Regards,
Jeff Davis

Attachment	Content-Type	Size
initdb-fsync-20120618.patch.gz	application/x-gzip	4.4 KB

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-06-18 18:52:25
Message-ID:	1340045545.19023.50.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2012-06-18 at 21:34 +0300, Peter Eisentraut wrote:
> On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
> > - defaulting to initdb -N in the regression suite is not a good imo,
> > because that way the buildfarm won't catch problems in that area...
> >
> The regression test suite also starts postgres with fsync off. This is
> good, and I don't want to have more overhead in the tests. It's not
> like we already have awesome test coverage for OS interaction.

OK, changing back to using -N during regression tests due to performance
concerns.

Regards,
Jeff Davis

Attachment	Content-Type	Size
initdb-fsync-20120618-2.patch.gz	application/x-gzip	4.7 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-06-18 18:57:51
Message-ID:	201206182057.51876.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday, June 18, 2012 08:39:47 PM Jeff Davis wrote:
> On Mon, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
> > Quick review:
> > - defaulting to initdb -N in the regression suite is not a good imo,
> > because that way the buildfarm won't catch problems in that area...
> I removed the -N as you suggest. How much does performance matter on the
> buildfarm?
I don't think the difference in initdb cost is relevant when running the
regression tests. Should it prove to be we can re-add -N after a week or two
in the buildfarm machines. I just remember that there were several OS specific
regression when adding the pre-syncing for createdb.

> > - could the copydir.c and initdb.c versions of walkdir/sync_fname et al
> > be unified?
> There's a lot of backend-specific code in the copydir versions, like
> using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
> unifying them before, and concluded that it wouldn't add to the
> readability, so I just commented where they came from.
Ok. Sensible reasons. I dislike that we know have two files using different
logic (copydir.c only using fadvise, initdb using sync_file_range if
available). Maybe we should just move the fadvise and sync_file_range calls
into its own common function?

> If you feel there will be a maintainability problem, I can give it
> another shot.
Its not too bad yet I guess, so ...

> > - I personally would find it way nicer to put USE_PRE_SYNC into
> > pre_sync_fname instead of cluttering the main function with it
> Done.

Looks good to me.

Btw, I just want to have said this, although I don't think its particularly
relevant as it doesn't affect correctness: Its possible to have a system where
sync_file_range is in the system headers but the kernel during runtime doesn't
support it. It is relatively new (2.6.17). It would be possible to fallback to
posix_fadvise which has been around far longer in that case...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-06-18 19:32:25
Message-ID:	1340047945.19023.55.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote:
> > > - defaulting to initdb -N in the regression suite is not a good imo,
> > > because that way the buildfarm won't catch problems in that area...
> > I removed the -N as you suggest. How much does performance matter on the
> > buildfarm?
> I don't think the difference in initdb cost is relevant when running the
> regression tests. Should it prove to be we can re-add -N after a week or two
> in the buildfarm machines. I just remember that there were several OS specific
> regression when adding the pre-syncing for createdb.

That sounds reasonable to me. Both patches are out there, so we can
figure out what the consensus is.

> > > - could the copydir.c and initdb.c versions of walkdir/sync_fname et al
> > > be unified?
> > There's a lot of backend-specific code in the copydir versions, like
> > using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
> > unifying them before, and concluded that it wouldn't add to the
> > readability, so I just commented where they came from.
> Ok. Sensible reasons. I dislike that we know have two files using different
> logic (copydir.c only using fadvise, initdb using sync_file_range if
> available). Maybe we should just move the fadvise and sync_file_range calls
> into its own common function?

I don't see fadvise in copydir.c, it looks like it just uses fsync. It
might speed it up to use a pre-sync call there, too -- database creation
does take a while.

If that's in the scope of this patch, I'll do it.

> Btw, I just want to have said this, although I don't think its particularly
> relevant as it doesn't affect correctness: Its possible to have a system where
> sync_file_range is in the system headers but the kernel during runtime doesn't
> support it. It is relatively new (2.6.17). It would be possible to fallback to
> posix_fadvise which has been around far longer in that case...

Interesting point, but I'm not too worried about it.

Regards,
Jeff Davis

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-06-18 19:41:01
Message-ID:	201206182141.02039.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday, June 18, 2012 09:32:25 PM Jeff Davis wrote:
> > > > - could the copydir.c and initdb.c versions of walkdir/sync_fname et
> > > > al be unified?
> > >
> > > There's a lot of backend-specific code in the copydir versions, like
> > > using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
> > > unifying them before, and concluded that it wouldn't add to the
> > > readability, so I just commented where they came from.
> >
> > Ok. Sensible reasons. I dislike that we know have two files using
> > different logic (copydir.c only using fadvise, initdb using
> > sync_file_range if available). Maybe we should just move the fadvise and
> > sync_file_range calls into its own common function?
>
> I don't see fadvise in copydir.c, it looks like it just uses fsync. It
> might speed it up to use a pre-sync call there, too -- database creation
> does take a while.
>
> If that's in the scope of this patch, I'll do it.
It calls pg_flush_data inside of copy_file which does the posix_fadvise... So
maybe just put the sync_file_range in pg_flush_data?

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-06-18 19:43:52
Message-ID:	1340048374-sup-5221@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Jeff Davis's message of lun jun 18 15:32:25 -0400 2012:
> On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote:

> > Btw, I just want to have said this, although I don't think its particularly
> > relevant as it doesn't affect correctness: Its possible to have a system where
> > sync_file_range is in the system headers but the kernel during runtime doesn't
> > support it. It is relatively new (2.6.17). It would be possible to fallback to
> > posix_fadvise which has been around far longer in that case...
>
> Interesting point, but I'm not too worried about it.

Yeah. 2.6.17 was released on June 2006. The latest stable release
prior to 2.6.17 was 2.6.16.62 in 2008 and it was abandoned at that time
in favor of 2.6.27 as the new stable branch.

I don't think this is a problem any sane person is going to face; and
telling them to upgrade to a newer kernel seems an acceptable answer.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-06-18 19:55:27
Message-ID:	1340049327.19023.59.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote:
> It calls pg_flush_data inside of copy_file which does the posix_fadvise... So
> maybe just put the sync_file_range in pg_flush_data?

Oh, I didn't notice that, thank you.

In that case, it may be good to combine them if possible. I will look
into it. There may be performance implications when used one a larger
amount of data though. I can do some brief testing.

Regards,
Jeff Davis

From:	David Fetter <david(at)fetter(dot)org>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject:	Re: initdb and fsync
Date:	2012-06-19 15:33:30
Message-ID:	20120619153330.GA9655@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 18, 2012 at 09:34:30PM +0300, Peter Eisentraut wrote:
> On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
> > - defaulting to initdb -N in the regression suite is not a good imo,
> > because that way the buildfarm won't catch problems in that area...
> >
> The regression test suite also starts postgres with fsync off. This is
> good, and I don't want to have more overhead in the tests. It's not
> like we already have awesome test coverage for OS interaction.

What sorts of coverage would we want?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-06-19 17:22:02
Message-ID:	1340126522.19023.78.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote:
> It calls pg_flush_data inside of copy_file which does the posix_fadvise... So
> maybe just put the sync_file_range in pg_flush_data?

The functions in fd.c aren't linked to initdb, so it's a challenge to
share that code (I remember now: that's why I didn't use pg_flush_data
originally). I don't see a clean way to resolve that, do you?

Also, it seems that sync_file_range() doesn't help with creating a
database much (on my machine), so it doesn't seem important to use it
there.

Right now I'm inclined to leave the patch as-is.

Regards,
Jeff Davis

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-06-19 17:25:40
Message-ID:	201206191925.40486.andres@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tuesday, June 19, 2012 07:22:02 PM Jeff Davis wrote:
> On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote:
> > It calls pg_flush_data inside of copy_file which does the
> > posix_fadvise... So maybe just put the sync_file_range in pg_flush_data?
>
> The functions in fd.c aren't linked to initdb, so it's a challenge to
> share that code (I remember now: that's why I didn't use pg_flush_data
> originally). I don't see a clean way to resolve that, do you?
>
> Also, it seems that sync_file_range() doesn't help with creating a
> database much (on my machine), so it doesn't seem important to use it
> there.
>
> Right now I'm inclined to leave the patch as-is.
Fine with that, I wanted to bring it up and see it documented.

I have marked it with ready for committer. That committer needs to decide on -
N in the regression tests or not, but that shouldn't be much of a problem ;)

Cool!

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-06-22 14:04:23
Message-ID:	CA+TgmoayOyfhwfZbcXZTpnfqFxjNzrQdwAeFEvhHV7caVr_p7g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 4, 2012 at 8:18 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Sat, Feb 04, 2012 at 03:41:27PM -0800, Jeff Davis wrote:
>> On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote:
>> > Yeah. Personally I would be sad if initdb got noticeably slower, and
>> > I've never seen or heard of a failure that this would fix.
>>
>> I worked up a patch, and it looks like it does about 6 file fsync's and
>> a 7th for the PGDATA directory. That degrades the time from about 1.1s
>> to 1.4s on my workstation.
>
>> So, is it worth it? Should we make it an option that can be specified?
>
> If we add fsync calls to the initdb process, they should cover the entire data
> directory tree. This patch syncs files that initdb.c writes, but we ought to
> also sync files that bootstrap-mode backends had written. An optimization
> like the pg_flush_data() call in copy_file() may reduce the speed penalty.
>
> initdb should do these syncs by default and offer an option to disable them.

This may be a stupid question, by why is it initdb's job to fsync the
files the server creates, rather than the server's job? Normally we
rely on the server to make its own writes persistent.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-06-23 01:46:52
Message-ID:	20120623014652.GA13646@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 22, 2012 at 10:04:23AM -0400, Robert Haas wrote:
> On Sat, Feb 4, 2012 at 8:18 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > If we add fsync calls to the initdb process, they should cover the entire data
> > directory tree. ?This patch syncs files that initdb.c writes, but we ought to
> > also sync files that bootstrap-mode backends had written. ?An optimization
> > like the pg_flush_data() call in copy_file() may reduce the speed penalty.
> >
> > initdb should do these syncs by default and offer an option to disable them.
>
> This may be a stupid question, by why is it initdb's job to fsync the
> files the server creates, rather than the server's job? Normally we
> rely on the server to make its own writes persistent.

Modularity would dictate having the server fsync its own work product, but I
expect that approach to perform materially worse. initdb runs many
single-user server instances, and each would fsync independently. When N
initdb steps change one file, it would see N fsyncs. Using sync_file_range to
queue all writes is best for the typical interactive or quasi-interactive
initdb user. It's not always a win for server fsyncs, so we would need to
either define special cases such that it's used during initdb or forgo the
optimization. On the other hand, the server could skip some files, like fsm
forks, in a principled manner.

Overall, I think it will hard to improve modularity while retaining the
performance Jeff's approach achieves through exploiting initdb's big-picture
perspective. So I favor how Jeff has implemented it.

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: initdb and fsync
Date:	2012-06-23 09:17:45
Message-ID:	1340443065.10342.23.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2012-06-22 at 10:04 -0400, Robert Haas wrote:
> This may be a stupid question, by why is it initdb's job to fsync the
> files the server creates, rather than the server's job? Normally we
> rely on the server to make its own writes persistent.

That was my first reaction as well:

http://archives.postgresql.org/message-id/1328468000.15224.32.camel@jdavis

However, from the discussion it seems like it will be harder to do it
that way (during normal operation, a checkpoint is what syncs the files;
but as Tom points out, bootstrap mode does not). Also, I would expect
the fsyncs to be in a less-controlled pattern, so it might perform more
poorly.

Regards,
Jeff Davis

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject:	Re: initdb and fsync
Date:	2012-06-23 21:34:27
Message-ID:	1340487267.14881.6.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On mån, 2012-06-18 at 20:57 +0200, Andres Freund wrote:
> I don't think the difference in initdb cost is relevant when running
> the regression tests. Should it prove to be we can re-add -N after a
> week or two in the buildfarm machines.

Keep in mind that the regression tests are not only run on the
buildfarm, and that pg_regress is not only used for the main regression
tests. When you're dealing with things in contrib or src/pl or external
modules, then adding 3 seconds to a 5 second test is not so nice.

> I just remember that there were several OS specific regression when
> adding the pre-syncing for createdb.

Sure, it would be useful to test this, but then we should also turn on
fsync in the backend. Make it a separate slow mode (or a separate fast
mode).

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-07-13 18:19:44
Message-ID:	6387.1342203584@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On Tuesday, June 19, 2012 07:22:02 PM Jeff Davis wrote:
>> Right now I'm inclined to leave the patch as-is.

> Fine with that, I wanted to bring it up and see it documented.

> I have marked it with ready for committer. That committer needs to decide on -
> N in the regression tests or not, but that shouldn't be much of a problem ;)

I'm picking up this patch now. What I'm inclined to do about the -N
business is to commit without that, so that we get a round of testing
in the buildfarm and find out about any portability issues, but then
change to use -N after a week or so. I agree that in the long run
we don't want regression tests to run with fsyncs by default.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-07-13 19:21:27
Message-ID:	18497.1342207287@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote:
>> Ok. Sensible reasons. I dislike that we know have two files using different
>> logic (copydir.c only using fadvise, initdb using sync_file_range if
>> available). Maybe we should just move the fadvise and sync_file_range calls
>> into its own common function?

> I don't see fadvise in copydir.c, it looks like it just uses fsync. It
> might speed it up to use a pre-sync call there, too -- database creation
> does take a while.

No, that's incorrect: the fadvise is there, inside pg_flush_data() which
is done during the writing phase. It seems to me that if we think
sync_file_range is a win, we ought to be using it in pg_flush_data just
as much as in initdb.

However, I'm a bit confused because in
http://archives.postgresql.org/pgsql-hackers/2012-03/msg01098.php
you said

> So, it looks like fadvise is the "right" thing to do, but I expect we'll

Was that backwards? If not, why are we bothering with taking any
portability risks by adding use of sync_file_range?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-07-13 21:35:06
Message-ID:	26056.1342215306@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> I'm picking up this patch now. What I'm inclined to do about the -N
> business is to commit without that, so that we get a round of testing
> in the buildfarm and find out about any portability issues, but then
> change to use -N after a week or so. I agree that in the long run
> we don't want regression tests to run with fsyncs by default.

Committed without the -N in pg_regress (for now). I also stuck
sync_file_range into the backend's pg_flush_data --- it would be
interesting to hear measurements of whether that makes a noticeable
difference for CREATE DATABASE.

regards, tom lane

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-07-13 21:41:41
Message-ID:	1342215701.7441.40.camel@sussancws0025
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2012-07-13 at 15:21 -0400, Tom Lane wrote:
> No, that's incorrect: the fadvise is there, inside pg_flush_data() which
> is done during the writing phase.

Yeah, Andres pointed that out, also.

> It seems to me that if we think
> sync_file_range is a win, we ought to be using it in pg_flush_data just
> as much as in initdb.

It was pretty clearly a win for initdb, but I wasn't convinced it was a
good idea for other use cases.

The mechanism is outlined in the email you linked below. To paraphrase,
fadvise tries to put the data in the io scheduler queue (still in the
kernel before going to the device), and gives up if there is no room;
sync_file_range waits for there to be room if none is available.

For the case of initdb, the data needing to be fsync'd is effectively
constant, and it's a lot of small files. If the requests don't make it
to the io scheduler queue before fsync, the kernel doesn't have an
opportunity to schedule them properly.

But for larger amounts of data copying (like ALTER DATABASE SET
TABLESPACE), it seemed like there was more risk that sync_file_range
would starve out other processes by continuously filling up the io
scheduler queue (I'm not sure if there are protections against that or
not). Also, if the files are larger, a single fsync represents more
data, and the kernel may be able to schedule it well enough anyway.

I'm not an authority in this area though, so if you are comfortable
extending sync_file_range to copydir() as well, that's fine with me.

> However, I'm a bit confused because in
> http://archives.postgresql.org/pgsql-hackers/2012-03/msg01098.php
> you said
>
> > So, it looks like fadvise is the "right" thing to do, but I expect we'll
>
> Was that backwards? If not, why are we bothering with taking any
> portability risks by adding use of sync_file_range?

That part of the email was less than conclusive, and I can't remember
exactly what point I was trying to make. sync_file_range is a big
practical win for the reasons I mentioned above (and also avoids some
unpleasant noises coming from the disk drive).

Regards,
Jeff Davis

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-07-13 22:11:37
Message-ID:	26690.1342217497@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> For the case of initdb, the data needing to be fsync'd is effectively
> constant, and it's a lot of small files. If the requests don't make it
> to the io scheduler queue before fsync, the kernel doesn't have an
> opportunity to schedule them properly.

> But for larger amounts of data copying (like ALTER DATABASE SET
> TABLESPACE), it seemed like there was more risk that sync_file_range
> would starve out other processes by continuously filling up the io
> scheduler queue (I'm not sure if there are protections against that or
> not). Also, if the files are larger, a single fsync represents more
> data, and the kernel may be able to schedule it well enough anyway.

> I'm not an authority in this area though, so if you are comfortable
> extending sync_file_range to copydir() as well, that's fine with me.

It could use some performance testing, which I don't have the time
for (or proper equipment). Anyone?

Also, I note that copy_file is set up so that it does sync_file_range or
fadvise for each 64K chunk of data, which seems mighty small. I wonder
if it'd be better to take that out of the loop and do one whole-file
advise at the end of the copy loop. Or at least use some larger
granularity for those calls.

regards, tom lane

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-07-13 23:25:39
Message-ID:	1342221939.30380.29.camel@sussancws0025
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2012-07-13 at 17:35 -0400, Tom Lane wrote:
> I wrote:
> > I'm picking up this patch now. What I'm inclined to do about the -N
> > business is to commit without that, so that we get a round of testing
> > in the buildfarm and find out about any portability issues, but then
> > change to use -N after a week or so. I agree that in the long run
> > we don't want regression tests to run with fsyncs by default.
>
> Committed without the -N in pg_regress (for now). I also stuck
> sync_file_range into the backend's pg_flush_data --- it would be
> interesting to hear measurements of whether that makes a noticeable
> difference for CREATE DATABASE.

Thank you.

One point about the commit message: fadvise does not block to go into
the request queue, sync_file_range does. The problem with fadvise is
that, when the request queue is small, it fills up so fast that most of
the requests never make it in, and fadvise is essentially a no-op.
sync_file_range waits for room in the queue, which is (based on my
tests) enough to improve the scheduling a lot.

Regards,
Jeff Davis

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject:	Re: initdb and fsync
Date:	2012-07-13 23:43:14
Message-ID:	28922.1342222994@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> One point about the commit message: fadvise does not block to go into
> the request queue, sync_file_range does. The problem with fadvise is
> that, when the request queue is small, it fills up so fast that most of
> the requests never make it in, and fadvise is essentially a no-op.
> sync_file_range waits for room in the queue, which is (based on my
> tests) enough to improve the scheduling a lot.

I see. I misunderstood your previous message. In that case, it seems
quite likely that it might be helpful if copy_file were to aggregate
the fadvise/sync_file_range calls over larger pieces of the file.
(I'm assuming that the request queue isn't bright enough to aggregate
by itself, though that might be wrong.)

regards, tom lane