pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after

Lists: pgsql-committerspgsql-hackers
From: stark(at)postgresql(dot)org (Greg Stark)
To: pgsql-committers(at)postgresql(dot)org
Subject: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 00:50:57
Message-ID: 20100215005057.5AC8F7541C5@cvs.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Log Message:
-----------
Speed up CREATE DATABASE by deferring the fsyncs until after copying
all the data and using posix_fadvise to nudge the OS into flushing it
earlier. This also hopefully makes CREATE DATABASE avoid spamming the
cache.

Tests show a big speedup on Linux at least on some filesystems.

Idea and patch from Andres Freund.

Modified Files:
--------------
pgsql/src/backend/storage/file:
fd.c (r1.153 -> r1.154)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/storage/file/fd.c?r1=1.153&r2=1.154)
pgsql/src/include/storage:
fd.h (r1.66 -> r1.67)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/include/storage/fd.h?r1=1.66&r2=1.67)
pgsql/src/port:
copydir.c (r1.25 -> r1.26)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/port/copydir.c?r1=1.25&r2=1.26)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 06:57:19
Message-ID: 10713.1266217039@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

stark(at)postgresql(dot)org (Greg Stark) writes:
> Log Message:
> -----------
> Speed up CREATE DATABASE by deferring the fsyncs until after copying
> all the data and using posix_fadvise to nudge the OS into flushing it
> earlier. This also hopefully makes CREATE DATABASE avoid spamming the
> cache.

The buildfarm indicates that this patch has got some serious issues.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 07:13:32
Message-ID: 11772.1266218012@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

I wrote:
> The buildfarm indicates that this patch has got some serious issues.

Actually, looking closer, some of the Windows machines started failing
after the *earlier* patch to add directory fsyncs.

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 08:36:31
Message-ID: 201002150936.35119.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Monday 15 February 2010 08:13:32 Tom Lane wrote:
> I wrote:
> > The buildfarm indicates that this patch has got some serious issues.
>
> Actually, looking closer, some of the Windows machines started failing
> after the *earlier* patch to add directory fsyncs.
And not only the windows machines. Seems sensible to add a configure check
whether directory-fsyncing works.
But at least I am not capable of writing good m4/configure.in/whatever without
strong supervision...

Will try if nobody else with more knowledge does and if somebody will look
over it afterwards.

Andres


From: marcin mank <marcin(dot)mank(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 09:36:40
Message-ID: b1b9fac61002150136x26be6083g780f9414045b63e1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Feb 15, 2010 at 9:36 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Monday 15 February 2010 08:13:32 Tom Lane wrote:
>> Actually, looking closer, some of the Windows machines started failing
>> after the *earlier* patch to add directory fsyncs.
> And not only the windows machines. Seems sensible to add a configure check
> whether directory-fsyncing works.

It looks like a thing that can be filesystem-dependent. Maybe a kind
of runtime check?

Greetings
Marcin Mańk


From: Andres Freund <andres(at)anarazel(dot)de>
To: marcin mank <marcin(dot)mank(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 10:02:50
Message-ID: 1266228170.2079.2.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Hi Marcin,

Sounds rather unlikely to me. Its likely handled at an upper layer (vfs in linux' case) and only overloaded when an optimized implementation is available.
Which os do you see implementing that only on a part of the filesystems?

A runtime check would be creating, fsyncing and deleting a directory for every directory youre fsyncing because they could be on a different fs...

Andres
--
Sent from a mobile phone - please excuse brevity and formatting.
----- Ursprüngliche Mitteilung -----
> On Mon, Feb 15, 2010 at 9:36 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On Monday 15 February 2010 08:13:32 Tom Lane wrote:
> > > Actually, looking closer, some of the Windows machines started failing
> > > after the *earlier* patch to add directory fsyncs.
> > And not only the windows machines. Seems sensible to add a configure check
> > whether directory-fsyncing works.
>
> It looks like a thing that can be filesystem-dependent. Maybe a kind
> of runtime check?
>
> Greetings
> Marcin Mańk
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


From: Greg Stark <stark(at)mit(dot)edu>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: marcin mank <marcin(dot)mank(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 11:19:24
Message-ID: 407d949e1002150319j47d4c26dn33c5ea74f03ca709@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Feb 15, 2010 at 10:02 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hi Marcin,
>
> Sounds rather unlikely to me. Its likely handled at an upper layer (vfs in linux' case) and only overloaded when an optimized implementation is available.
> Which os do you see implementing that only on a part of the filesystems?
>
> A runtime check would be creating, fsyncing and deleting a directory for every directory youre fsyncing because they could be on a different fs...

We could just not check the result code of the fsync. Or print a
warning the first time and stop trying subsequently.

When do we cut the alpha? If I look at it at about 10-11pm EST is that too late?

--
greg


From: marcin mank <marcin(dot)mank(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 11:34:44
Message-ID: b1b9fac61002150334k25a4977aj1f73e9dbcc2279a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Feb 15, 2010 at 11:02 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hi Marcin,
>
> Sounds rather unlikely to me. Its likely handled at an upper layer (vfs in linux' case) and only overloaded when an optimized implementation is available.
> Which os do you see implementing that only on a part of the filesystems?
>

I have a Windows XP dev machine, which runs virtualbox, which runs
ubuntu, which mounts a windows directory through vboxfs

fsync does error out on directories inside that mount.

btw: 8.4.2 initdb won`t work there too, So this is not a regression.
The error is:
DEBUG: creating and filling new WAL file
LOG: could not link file "pg_xlog/xlogtemp.2367" to
"pg_xlog/000000010000000000000000" (initialization of log file 0,
segment 0): Operation not permitted
FATAL: could not open file "pg_xlog/000000010000000000000000" (log
file 0, segment 0): No such file or directory

But I would not be that sure that eg. NFS or something like that won`t complain.

Ignoring the return code seems the right choice.

Greetings
Marcin Mańk


From: Greg Stark <stark(at)mit(dot)edu>
To: marcin mank <marcin(dot)mank(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 11:45:39
Message-ID: 407d949e1002150345v4ad4f665j9adbec94bf2ea996@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Feb 15, 2010 at 11:34 AM, marcin mank <marcin(dot)mank(at)gmail(dot)com> wrote:
> LOG:  could not link file "pg_xlog/xlogtemp.2367" to
> "pg_xlog/000000010000000000000000" (initialization of log file 0,
>

This is not related -- it seems your filesystem doesn't support hard
links. I thought we used "junctions" on versions of Windows that
support them which I would have expected would include XP but my
knowledge of Windows is thin and obsolete.

--
greg


From: Andres Freund <andres(at)anarazel(dot)de>
To: marcin mank <marcin(dot)mank(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 11:46:51
Message-ID: 201002151246.52392.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Monday 15 February 2010 12:34:44 marcin mank wrote:
> On Mon, Feb 15, 2010 at 11:02 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Hi Marcin,
> >
> > Sounds rather unlikely to me. Its likely handled at an upper layer (vfs
> > in linux' case) and only overloaded when an optimized implementation is
> > available. Which os do you see implementing that only on a part of the
> > filesystems?
>
> I have a Windows XP dev machine, which runs virtualbox, which runs
> ubuntu, which mounts a windows directory through vboxfs

> btw: 8.4.2 initdb won`t work there too, So this is not a regression.
> The error is:
> DEBUG: creating and filling new WAL file
> LOG: could not link file "pg_xlog/xlogtemp.2367" to
> "pg_xlog/000000010000000000000000" (initialization of log file 0,
> segment 0): Operation not permitted
> FATAL: could not open file "pg_xlog/000000010000000000000000" (log
> file 0, segment 0): No such file or directory
That does seem to be a different issue. Currently there are no fsyncs on
directories at all, so likely your setup is hosed anyway ;-)

> But I would not be that sure that eg. NFS or something like that won`t
> complain.
It does not.

> Ignoring the return code seems the right choice.
And the error hiding one as well. With delayed allocation you theoretically
could error out on fsync with -ENOSPC ...

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Greg Stark <stark(at)mit(dot)edu>, marcin mank <marcin(dot)mank(at)gmail(dot)com>
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 11:50:49
Message-ID: 201002151250.51041.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Monday 15 February 2010 12:45:39 Greg Stark wrote:
> On Mon, Feb 15, 2010 at 11:34 AM, marcin mank <marcin(dot)mank(at)gmail(dot)com> wrote:
> > LOG: could not link file "pg_xlog/xlogtemp.2367" to
> > "pg_xlog/000000010000000000000000" (initialization of log file 0,
>
> This is not related -- it seems your filesystem doesn't support hard
> links. I thought we used "junctions" on versions of Windows that
> support them which I would have expected would include XP but my
> knowledge of Windows is thin and obsolete.
If I understood him correctly marcin seems to mount a windows share on linux
via some vbox-proprietary pseudo filesystem. That wont get detected and thus
no junctions will be used... (I have doubts you even can create them via
vboxfs (or even smb)).
I would consider that a unsupported setup. Agreed?

Andres


From: Greg Stark <stark(at)mit(dot)edu>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, marcin mank <marcin(dot)mank(at)gmail(dot)com>
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 11:55:36
Message-ID: 407d949e1002150355k27e1f7f0v9faf3a32941e2244@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Feb 15, 2010 at 11:50 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> If I understood him correctly marcin seems to mount a windows share on linux
> via some vbox-proprietary pseudo filesystem. That wont get detected and thus
> no junctions will be used... (I have doubts you even can create them via
> vboxfs (or even smb)).
> I would consider that a unsupported setup. Agreed?

I'm not sure which versions of Windows we support in general. But on
further thought I thought we only used hard links for xlog files on
systems where we knew they worked and just did a rename() on systems
without them. So I'm puzzled why we're trying to hard link on this
system. Perhaps we need to make this a run-time check instead of just
making it depend on the system.

--
greg


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Greg Stark <stark(at)mit(dot)edu>, marcin mank <marcin(dot)mank(at)gmail(dot)com>
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 12:01:46
Message-ID: 201002151301.47920.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Monday 15 February 2010 12:55:36 Greg Stark wrote:
> On Mon, Feb 15, 2010 at 11:50 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > If I understood him correctly marcin seems to mount a windows share on
> > linux via some vbox-proprietary pseudo filesystem. That wont get
> > detected and thus no junctions will be used... (I have doubts you even
> > can create them via vboxfs (or even smb)).
> > I would consider that a unsupported setup. Agreed?
>
> I'm not sure which versions of Windows we support in general. But on
> further thought I thought we only used hard links for xlog files on
> systems where we knew they worked and just did a rename() on systems
> without them. So I'm puzzled why we're trying to hard link on this
> system. Perhaps we need to make this a run-time check instead of just
> making it depend on the system.
Well, I guess linux is normally a system where hardlinking is considered safe.
And I dont really see a problem with that - for example we require ntfs on
windows as well...
In the end its only some strange filesystem whats causing the issue here...

Andres


From: marcin mank <marcin(dot)mank(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 13:50:03
Message-ID: b1b9fac61002150550u66077335qe1b971338193e5cf@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Yes, the issue with initdb failing is unrelated (and I have no problem
about the fs being unsupported). But fsync still DOES fail on
directories from the mount.

>> But I would not be that sure that eg. NFS or something like that won`t
>> complain.
> It does not.
>

What if someone mounts a NFS share from a system that does not support
directory fsync (per buildfarm: unixware, AIX) on Linux? I agree that
this is asking for trouble, but...

Greetings
Marcin Mańk


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: marcin mank <marcin(dot)mank(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 14:02:35
Message-ID: 201002151502.37654.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Monday 15 February 2010 14:50:03 marcin mank wrote:
> Yes, the issue with initdb failing is unrelated (and I have no problem
> about the fs being unsupported). But fsync still DOES fail on
> directories from the mount.
>
> >> But I would not be that sure that eg. NFS or something like that won`t
> >> complain.
> >
> > It does not.
>
> What if someone mounts a NFS share from a system that does not support
> directory fsync (per buildfarm: unixware, AIX) on Linux? I agree that
> this is asking for trouble, but...
Then nothing. The fsync via nfs or such is a local operation. There is nothing
like a "fsync" command transported - i.e. the fsync controls the local cache
not the remote one...

Andres


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: marcin mank <marcin(dot)mank(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-15 14:15:40
Message-ID: 9837222c1002150615v4c7584e8gac67834ae9b5305c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

2010/2/15 Greg Stark <stark(at)mit(dot)edu>:
> On Mon, Feb 15, 2010 at 11:34 AM, marcin mank <marcin(dot)mank(at)gmail(dot)com> wrote:
>> LOG:  could not link file "pg_xlog/xlogtemp.2367" to
>> "pg_xlog/000000010000000000000000" (initialization of log file 0,
>>
>
> This is not related -- it seems your filesystem doesn't support hard
> links. I thought we used "junctions" on versions of Windows that
> support them which I would have expected would include XP but my
> knowledge of Windows is thin and obsolete.

Junctions are for symbolic links, and only valid for directories. NTFS
has "real" hardlinks though CreateLink(). No idea if that works on
remote filesystems though.

But AFAIK, we don't use that on Windows. But the rest of the thread
has indicated why this shows up anyway :)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-committers(at)postgresql(dot)org
Cc: Greg Stark <stark(at)postgresql(dot)org>
Subject: Re: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-21 23:43:26
Message-ID: 201002220043.29420.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Monday 15 February 2010 01:50:57 Greg Stark wrote:
> Log Message:
> -----------
> Speed up CREATE DATABASE by deferring the fsyncs until after copying
> all the data and using posix_fadvise to nudge the OS into flushing it
> earlier. This also hopefully makes CREATE DATABASE avoid spamming the
> cache.
>
> Tests show a big speedup on Linux at least on some filesystems.
>
> Idea and patch from Andres Freund.
I just found a relatively big problem with one of your modifications on the
patch - you removed the
FreeDir(xldir);
xldir = AllocateDir(fromdir);
pair - unfortunately its crucial because otherwise the DIR does not get
rewound - that resulted in *no* files getting fsync()ed (otherwise the loop
above wouldn't have finished yet...).
I think that was also causing the problems I pointed out in " Directory fsync
and other fun"...

You removed it because you didn't want to open the directory twice? I think
doing that is simpler than using rewinddir - I have no idea how usable that
one is on windows for example

Could you add it back?

Andres


From: Greg Stark <stark(at)mit(dot)edu>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-committers(at)postgresql(dot)org, Greg Stark <stark(at)postgresql(dot)org>
Subject: Re: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-22 00:11:49
Message-ID: 407d949e1002211611v8196cc4mfdb2b0d205cc7d10@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Sun, Feb 21, 2010 at 11:43 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Could you add it back?
>

Oops, sorry. Sigh. Done.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-committers(at)postgresql(dot)org, Greg Stark <stark(at)postgresql(dot)org>
Subject: Re: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-22 02:54:40
Message-ID: 3943.1266807280@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> I just found a relatively big problem with one of your modifications on the
> patch - you removed the
> FreeDir(xldir);
> xldir = AllocateDir(fromdir);
> pair - unfortunately its crucial because otherwise the DIR does not get
> rewound - that resulted in *no* files getting fsync()ed (otherwise the loop
> above wouldn't have finished yet...).
> I think that was also causing the problems I pointed out in " Directory fsync
> and other fun"...

Actually, that code had *multiple* problems including stat'ing the wrong
file entirely, not to mention that this last commit failed to even
compile. I also think it should scan the todir not the fromdir, just on
general principles to avoid any possibility of race conditions.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-committers(at)postgresql(dot)org, Greg Stark <stark(at)postgresql(dot)org>
Subject: Re: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-22 10:15:18
Message-ID: 407d949e1002220215l6fe8816al6703a5dc79e24894@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Feb 22, 2010 at 2:54 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Actually, that code had *multiple* problems including stat'ing the wrong
> file entirely, not to mention that this last commit failed to even
> compile.  I also think it should scan the todir not the fromdir, just on
> general principles to avoid any possibility of race conditions.

Argh. I'll be less careless in the future, I promise.

I had concluded that scanning the original directory was odd but
better because it served to double-check that all the original files
actually made it and also because if there were any unrelated files
present there was no need to fsync them. But I agree it's odd and not
very general for copydir if we decide to use it elsewhere other than
create database.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-committers(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-22 14:53:59
Message-ID: 24404.1266850439@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> On Mon, Feb 22, 2010 at 2:54 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I also think it should scan the todir not the fromdir, just on
>> general principles to avoid any possibility of race conditions.

> I had concluded that scanning the original directory was odd but
> better because it served to double-check that all the original files
> actually made it and also because if there were any unrelated files
> present there was no need to fsync them.

Well, just for the record: if that was actually intentional then both of
you erred seriously by not including a comment that explained that the
coding was intentional (and giving the reasoning). Any reader of the
code would have assumed that it was a copy-and-paste error, as I did.

> But I agree it's odd and not
> very general for copydir if we decide to use it elsewhere other than
> create database.

Yeah, to me it seems more likely to cause problems down the road than
to catch anything. If the system is missing directory entries during
ReadDir then we have problems far beyond what copydir can deal with.
The point of the fsync loop is just to force the copy results down to
the platter, not to cross-check that the source directory isn't
changing. (And, what's more, I don't believe that the source directory
can't change during CREATE DATABASE. Consider delayed cleanup of
deleted relations during checkpoints.)

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] Re: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-22 15:42:08
Message-ID: 25289.1266853328@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

BTW, I notice that after allegedly fixing things, we are now seeing
fsync failures during CREATE DATABASE in the installcheck phase of
buildfarm runs on (apparently) all the Windows critters, plus a
couple of other platforms too. This mystifies me. I could believe
that there was something still wrong with copydir.c, but then how
come these machines are getting through the earlier "make check"
phase?

I made a couple of code tweaks just now to try to get more information
--- the reported EBADF error numbers seem fairly implausible in
themselves, so I wondered if that's *really* what fsync is reporting.
I don't have a lot of hope for that though.

Any theories about what is happening?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] Re: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-23 05:37:26
Message-ID: 10494.1266903446@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

I wrote:
> Any theories about what is happening?

Hah --- the AIX failures, at least, are explained at
http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.basetechref/doc/basetrf1/fsync.htm
which says

Error Codes

The fsync or fsync_range subroutine is unsuccessful if one or more of the following are true:

EIO An I/O error occurred while reading from or writing to the file system.
EBADF The FileDescriptor parameter is not a valid file descriptor open for writing.
EINVAL The file is not a regular file.
EINTR The subroutine was interrupted by a signal.

So the problem is that fsync_fname is trying to fsync a file it's opened
O_RDONLY. I don't know whether Windows is similarly picky, but we'll
soon find out.

Now, this doesn't mean that all is fine and dandy. I believe that a
majority of Unixen will reject attempts to open directories for writing,
so this solution puts us even further away from being able to fsync the
directories. I would bet however that the platforms that reject this
are ones that don't need fsync on directories. Maybe we just have to
have two different code paths depending on platform :-(

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <stark(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] Re: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-23 07:43:21
Message-ID: 201002230843.32394.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Hi Tom,

On Tuesday 23 February 2010 06:37:26 Tom Lane wrote:
> I wrote:
> > Any theories about what is happening?
> Now, this doesn't mean that all is fine and dandy. I believe that a
> majority of Unixen will reject attempts to open directories for writing,
> so this solution puts us even further away from being able to fsync the
> directories. I would bet however that the platforms that reject this
> are ones that don't need fsync on directories. Maybe we just have to
> have two different code paths depending on platform :-(
Cool.
You can't open a directory for writing under linux as well though - so that
wont be the decisive argument. Do you have a better idea than a configure
test?

Andres


From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <stark(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] Re: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-23 09:12:02
Message-ID: 407d949e1002230112o569eea1w97fd1aacc5a8bcec@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Tue, Feb 23, 2010 at 5:37 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> So the problem is that fsync_fname is trying to fsync a file it's opened
> O_RDONLY.  I don't know whether Windows is similarly picky, but we'll
> soon find out.
>

Argh, now I feel silly. I had actually found that in my searches after
the first batch of problems. But somehow i didn't connect that to the
current problems. Sorry.

There are other similarly confused OSes that don't allow fsync on
read-only file descriptors:
http://svn.haxx.se/dev/archive-2006-02/0488.shtml
(I wonder if some of them are doing fsync wrong and only syncing
changes written to this file descriptor and not any file descriptor
for this file?)

The plan I was thinking of was to pass a flag indicating if it's a
directory to fsync_fname() and open it RD_ONLY if it's a directory and
RDRW if it's a file. Then ignore any error returns (or at least EBADF
and EINVAL) iff the flag indicating it was a directory was true.

I don't like using configure tests for this because I fear someone
could compile Postgres on a system with one set of behaviour and then
switch to a different kernel version with a different set of
behaviour. In the worst case it could be filesystem dependent whether
you can open directories or whether they accept fsyncs.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] Re: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-23 15:38:53
Message-ID: 21981.1266939533@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Greg Stark <stark(at)mit(dot)edu> writes:
> I don't like using configure tests for this because I fear someone
> could compile Postgres on a system with one set of behaviour and then
> switch to a different kernel version with a different set of
> behaviour. In the worst case it could be filesystem dependent whether
> you can open directories or whether they accept fsyncs.

Yeah, and there's also the problem of cross-compilation. I don't want
a configure test either if we can avoid it.

> The plan I was thinking of was to pass a flag indicating if it's a
> directory to fsync_fname() and open it RD_ONLY if it's a directory and
> RDRW if it's a file. Then ignore any error returns (or at least EBADF
> and EINVAL) iff the flag indicating it was a directory was true.

Works for me, but let's first try just ignoring EBADF, which is the only
value we saw in the recent buildfarm failures. If we got past the fd<0
test then EBADF could only indicate a rdonly failure, whereas it's less
clear what EINVAL might cover.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] Re: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-23 16:13:06
Message-ID: 22615.1266941586@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

I wrote:
> BTW, I notice that after allegedly fixing things, we are now seeing
> fsync failures during CREATE DATABASE in the installcheck phase of
> buildfarm runs on (apparently) all the Windows critters, plus a
> couple of other platforms too. This mystifies me. I could believe
> that there was something still wrong with copydir.c, but then how
> come these machines are getting through the earlier "make check"
> phase?

BTW, although things seem to be going green with the RDONLY->RDWR
change, I'm still mystified why these machines didn't fail at
"make check". Is it possible that "make check" runs the postmaster
with fsync disabled? I don't see that in the code anywhere...

regards, tom lane


From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] Re: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-28 18:00:07
Message-ID: 407d949e1002281000q648770eegff28a44fd0cb1751@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Tue, Feb 23, 2010 at 3:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The plan I was thinking of was to pass a flag indicating if it's a
>> directory to fsync_fname() and open it RD_ONLY if it's a directory and
>> RDRW if it's a file. Then ignore any error returns (or at least EBADF
>> and EINVAL) iff the flag indicating it was a directory was true.
>
> Works for me, but let's first try just ignoring EBADF, which is the only
> value we saw in the recent buildfarm failures.  If we got past the fd<0
> test then EBADF could only indicate a rdonly failure, whereas it's less
> clear what EINVAL might cover.

So I'm thinking of something like this.
Ignore ESDIR when opening a directory (and return immediately)
and ignore EBADF when trying to fsync a directory.

--
greg

Attachment Content-Type Size
create-database-fsync-dirs-context.diff text/x-patch 2.9 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] Re: pgsql: Speed up CREATE DATABASE by deferring the fsyncs until after
Date: 2010-02-28 19:31:00
Message-ID: 12428.1267385460@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Greg Stark <stark(at)mit(dot)edu> writes:
> So I'm thinking of something like this.
> Ignore ESDIR when opening a directory (and return immediately)
> and ignore EBADF when trying to fsync a directory.

Seems reasonable, but get rid of the comment "However we can't do this
just yet, it has portability issues"; and you've got a double semicolon
in one place. It might also be worth commenting the BasicOpenFile calls
along the lines of "Many OSs don't let us open directories RDWR, while
some reject fsync on files opened RDONLY, so we need two cases."

regards, tom lane