fsync, ext2 on Linux

Lists: pgsql-hackers
From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: pgsql-hackers(at)postgresql(dot)org
Subject: fsync, ext2 on Linux
Date: 2004-10-31 11:18:20
Message-ID: Pine.OSF.4.61.0410311246390.238375@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The Linux fsync man page says:

"It does not necessarily ensure that the entry in the directory
containing the file has also reached disk. For that an explicit fsync on
the file descriptor of the directory is also needed."

AFAIK, we don't care about it at the moment. The actual behaviour depends
on the filesystem, reiserfs and other journaling filesystems probably
don't need the explicit fsync on the parent directory, but at least ext2
does.

I've experimented with a user-mode-linux installation, crashing it at
specific points. It seems that on ext2, it's possible to get the database
in non-consistent state.

Especially:

1. start transaction
2. do a lot of updates, so that a new xlog file is created
3. commit
4. crash

Sometimes the creation of the new xlog file is lost, losing the already
committed transaction.

I also got into this situation after one crash test:

template1=# SELECT * FROM foo;
ERROR: could not access status of transaction 1768515945
DETAIL: could not open file
"/home/hlinnaka/pgsql/data_broken/pg_clog/0696": No such file or directory

I haven't tried to debug it more deeply.

Should we fix this by fsyncing the parent directory of new files? We could
also declare ext2 broken, but there could be others.

- Heikki


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fsync, ext2 on Linux
Date: 2004-10-31 12:12:40
Message-ID: 4184D6B8.3080209@opencloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> The Linux fsync man page says:
>
> "It does not necessarily ensure that the entry in the directory
> containing the file has also reached disk. For that an explicit fsync on
> the file descriptor of the directory is also needed."
>
> AFAIK, we don't care about it at the moment. The actual behaviour
> depends on the filesystem, reiserfs and other journaling filesystems
> probably don't need the explicit fsync on the parent directory, but at
> least ext2 does.
>
> I've experimented with a user-mode-linux installation, crashing it at
> specific points. It seems that on ext2, it's possible to get the
> database in non-consistent state.

Have you experimented with mounting the filesystem with the dirsync
option ('-o dirsync') or marking the log directory as synchronous with
'chattr +D'? (no, it's not a real fix, just another data point..)

-O


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fsync, ext2 on Linux
Date: 2004-10-31 15:15:01
Message-ID: 27660.1099235701@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> The Linux [ext2] fsync man page says:
> "It does not necessarily ensure that the entry in the directory
> containing the file has also reached disk. For that an explicit fsync on
> the file descriptor of the directory is also needed."

This seems so broken as to defy belief. A process creating a file
doesn't normally *have* a file descriptor for the parent directory,
and I don't think the concept of an FD for a directory is even
portable (opendir() certainly doesn't return an FD). One might also
ask if we are expected to fsync everything up to the root in order
to be sure that the file remains accessible, and how exactly we should
do that on directories we don't have write access for.

In general we expect the filesystem to take care of its own metadata.
Run ext3 in journaling mode, or something like that.

(It occurs to me that the admin guide really ought to have a few words
about recommended and non-recommended filesystems ...)

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: fsync, ext2 on Linux
Date: 2004-10-31 15:31:24
Message-ID: 4185054C.9050500@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

>Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
>
>
>>The Linux [ext2] fsync man page says:
>>"It does not necessarily ensure that the entry in the directory
>>containing the file has also reached disk. For that an explicit fsync on
>>the file descriptor of the directory is also needed."
>>
>>
>
>This seems so broken as to defy belief. A process creating a file
>doesn't normally *have* a file descriptor for the parent directory,
>and I don't think the concept of an FD for a directory is even
>portable (opendir() certainly doesn't return an FD). One might also
>ask if we are expected to fsync everything up to the root in order
>to be sure that the file remains accessible, and how exactly we should
>do that on directories we don't have write access for.
>
>

The notes say this:

When an ext2 file system is mounted with the sync option,
directory
entries are also implicitly synced by fsync.

cheers

andrew


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: fsync, ext2 on Linux
Date: 2004-10-31 15:39:36
Message-ID: 41850738.4020807@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


>In general we expect the filesystem to take care of its own metadata.
>Run ext3 in journaling mode, or something like that.
>
>(It occurs to me that the admin guide really ought to have a few words
>about recommended and non-recommended filesystems ...)
>
>
Well I am not their admin, but I don't suggest any of the ext systems.
Although ext3 is reasonably stable it is very slow.

Stick with XFS, JFS or even Reiser.

Sincerely,

Joshua D. Drake

> regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>
>

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd(at)commandprompt(dot)com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL

Attachment Content-Type Size
jd.vcf text/x-vcard 285 bytes

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Oliver Jowett <oliver(at)opencloud(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fsync, ext2 on Linux
Date: 2004-10-31 18:02:09
Message-ID: Pine.OSF.4.61.0410311936360.256140@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 1 Nov 2004, Oliver Jowett wrote:

> Heikki Linnakangas wrote:
>> The Linux fsync man page says:
>>
>> "It does not necessarily ensure that the entry in the directory containing
>> the file has also reached disk. For that an explicit fsync on the file
>> descriptor of the directory is also needed."
>>
>> AFAIK, we don't care about it at the moment. The actual behaviour depends
>> on the filesystem, reiserfs and other journaling filesystems probably don't
>> need the explicit fsync on the parent directory, but at least ext2 does.
>>
>> I've experimented with a user-mode-linux installation, crashing it at
>> specific points. It seems that on ext2, it's possible to get the database
>> in non-consistent state.
>
> Have you experimented with mounting the filesystem with the dirsync option
> ('-o dirsync') or marking the log directory as synchronous with 'chattr +D'?
> (no, it's not a real fix, just another data point..)

Quick experiment shows that they seem to fix it as expected.

"chattr +D" might not be such a bad idea. A warning would be nice if you
start the postmaster on a filesystem that requires it. Few admins would
remember/know about it otherwise.

- Heikki


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: fsync, ext2 on Linux
Date: 2004-10-31 18:19:35
Message-ID: Pine.OSF.4.61.0410312005580.256140@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 31 Oct 2004, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
>> The Linux [ext2] fsync man page says:
>> "It does not necessarily ensure that the entry in the directory
>> containing the file has also reached disk. For that an explicit fsync on
>> the file descriptor of the directory is also needed."
>
> This seems so broken as to defy belief. A process creating a file
> doesn't normally *have* a file descriptor for the parent directory,
> and I don't think the concept of an FD for a directory is even
> portable (opendir() certainly doesn't return an FD). One might also
> ask if we are expected to fsync everything up to the root in order
> to be sure that the file remains accessible, and how exactly we should
> do that on directories we don't have write access for.

I agree on the brokeness. Linux is the only OS that's broken that I know
of. Therefore it doesn't really matter if the fix is portable or not, we
would only do it on Linux anyway.

Surely it's not necessary to crawl up to the root. Just fsync the
parent of every new file and directory.

> In general we expect the filesystem to take care of its own metadata.
> Run ext3 in journaling mode, or something like that.

I normally run reiserfs, I set up the ext2 filesystem just to test it.

> (It occurs to me that the admin guide really ought to have a few words
> about recommended and non-recommended filesystems ...)

That's the least we can do. I wonder if we could check the filesystem at
runtime and issue a warning if it's not in the list of recommended
filesystems.

- Heikki