Re: New Linux Filesystem: NILFS

Lists: pgsql-hackers
From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: New Linux Filesystem: NILFS
Date: 2006-09-05 22:24:28
Message-ID: 60fyf6vtnn.fsf_-_@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Recently seen in ACM Operating Systems Review (this is the first time
I've found as many as 1 interesting article in it in a while, and
there were 3 things I found worthwhile...):

NTT (of the recent "NTT Power Hour") have created a new filesystem:
<http://www.nilfs.org/en/>

NILFS is a log-structured file system developed for Linux.

In effect, it provides the "moral equivalent" to MVCC for filesystems;
overwrites are equivalent to delete/insert, and requires a "Cleaner"
process in order to clean out formerly-used space.

It ought to have two merits over journalling filesystems:

1. It doesn't need to write data twice, which should improve
performance

2. It doesn't need to repetitively overwrite metadata, which should
improve crash safety.

On the latter, per the paper:

"... These journaling filesystems enable fast and consistent recovery
of the file system after unexpected system freezes or power
failures. However, they still allow the fatal destruction of the file
system due to the characteristic that recovery is realized by
overwriting meta data with their copies saved in a journal file. This
recovery is guaranteed to work properly only if the write order of the
on-disk data blocks and meta data blocks is physically conserved on
the disk platters. Unfortunately, this constraint is often violated by
the write optimizations performed by the block I/O subsystem and disk
controllers."

It's still at a somewhat early stage, as they haven't completed coding
the Cleaner. (Probably should call it the Reaper... :-))

By the way, the Google SOC 2005 also produced one:
<http://logfs.sourceforge.net/>

NetBSD used to have a LFS; has that gone anywhere? Or been
essentially dropped?
--
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://linuxdatabases.info/info/emacs.html
"I develop for Linux for a living, I used to develop for DOS. Going
from DOS to Linux is like trading a glider for an F117."
-- <entropy(at)world(dot)std(dot)com> Lawrence Foard


From: Douglas McNaught <doug(at)mcnaught(dot)org>
To: Chris Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: New Linux Filesystem: NILFS
Date: 2006-09-05 22:36:15
Message-ID: 87r6yqndpc.fsf@suzuka.mcnaught.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Chris Browne <cbbrowne(at)acm(dot)org> writes:

> NetBSD used to have a LFS; has that gone anywhere? Or been
> essentially dropped?

My reading over the last few years has indicated that LFSs tend to
suffer bad performance degradation as data and metadata for a given
file get scattered all over the disk. This tends to cancel out the
performance gain from being able to cluster writes in a single area.
For a heavily write-intensive workload, it might be a win, but no one
seems to have demonstrated an advantage for "normal", mostly
read-heavy usage.

-Doug


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Chris Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: New Linux Filesystem: NILFS
Date: 2006-09-06 00:54:50
Message-ID: 1157504090.20589.55.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2006-09-05 at 18:24 -0400, Chris Browne wrote:
> Recently seen in ACM Operating Systems Review (this is the first time
> I've found as many as 1 interesting article in it in a while, and
> there were 3 things I found worthwhile...):
>
> NTT (of the recent "NTT Power Hour") have created a new filesystem:
> <http://www.nilfs.org/en/>
>
> NILFS is a log-structured file system developed for Linux.
>

As I understand LFSs, they are not ideal for a database system. An LFS
is optimized so that it writes sequentially. However, PostgreSQL already
writes transactions sequentially in the WAL, and tries to optimize the
cleaning of dirty data pages with the background writer. So I don't see
the advantage of an LFS for a database.

Also, LFSs assume very effective read cache. Databases often hold much
more than can fit in read cache, and so frequently require disk access
for reads. An LFS scatters the data all over the disk, which destroys
the sequential access that PostgreSQL depends on for efficient index and
table scans.

Do you see an advantage in using LFS for PostgreSQL?

Did the quotation refer to people leaving write cache enabled on a
journaling filesystem?

Regards,
Jeff Davis


From: mark(at)mark(dot)mielke(dot)cc
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: New Linux Filesystem: NILFS
Date: 2006-09-06 03:28:06
Message-ID: 20060906032806.GA8298@mark.mielke.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 05, 2006 at 05:54:50PM -0700, Jeff Davis wrote:
> On Tue, 2006-09-05 at 18:24 -0400, Chris Browne wrote:
> > Recently seen in ACM Operating Systems Review (this is the first time
> > I've found as many as 1 interesting article in it in a while, and
> > there were 3 things I found worthwhile...):
> > ...
> > NILFS is a log-structured file system developed for Linux.
> As I understand LFSs, they are not ideal for a database system. An LFS
> is optimized so that it writes sequentially. However, PostgreSQL already
> ...
> Do you see an advantage in using LFS for PostgreSQL?

Hey guys - I think the original poster only meant to suggest that it
was *interesting*... :-)

To me, applying database concepts to file systems is interesting, and
practical. It's not a perfected science by any means, but the idea that
a file system is a hierarchical database isn't new. :-)

Applying any database on top of another database seems inefficient to me.
That's one reason why I argue the opposite - PostgreSQL *should* have its
own on disk layout, and not being laid out on top of another generic
system designed for purposes other than database storage. The reason it
isn't pursued at present, and perhaps should not be pursued at present,
is that PostgreSQL has other more important priorities in the short term.

Cheers,
mark

--
mark(at)mielke(dot)cc / markm(at)ncf(dot)ca / markm(at)nortel(dot)com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: mark(at)mark(dot)mielke(dot)cc
Cc: Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: New Linux Filesystem: NILFS
Date: 2006-09-06 17:22:05
Message-ID: 1157563325.20589.119.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2006-09-05 at 23:28 -0400, mark(at)mark(dot)mielke(dot)cc wrote:
> On Tue, Sep 05, 2006 at 05:54:50PM -0700, Jeff Davis wrote:
> > On Tue, 2006-09-05 at 18:24 -0400, Chris Browne wrote:
> > > Recently seen in ACM Operating Systems Review (this is the first time
> > > I've found as many as 1 interesting article in it in a while, and
> > > there were 3 things I found worthwhile...):
> > > ...
> > > NILFS is a log-structured file system developed for Linux.
> > As I understand LFSs, they are not ideal for a database system. An LFS
> > is optimized so that it writes sequentially. However, PostgreSQL already
> > ...
> > Do you see an advantage in using LFS for PostgreSQL?
>
> Hey guys - I think the original poster only meant to suggest that it
> was *interesting*... :-)
>

I see, my mistake.

> Applying any database on top of another database seems inefficient to me.
> That's one reason why I argue the opposite - PostgreSQL *should* have its
> own on disk layout, and not being laid out on top of another generic
> system designed for purposes other than database storage. The reason it
> isn't pursued at present, and perhaps should not be pursued at present,
> is that PostgreSQL has other more important priorities in the short term.
>

I think that it would be a higher priority if someone showed a
substantial performance improvement. Some filesystems don't really cause
much overhead that isn't needed by PostgreSQL.

If someone did show a substantial improvement, I would be interested to
see it.

And if there is an improvement, shouldn't that be a project for
something like Linux, where other databases could also benefit? It could
just be implemented as a database-specific filesystem.

Regards,
Jeff Davis


From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: New Linux Filesystem: NILFS
Date: 2006-09-06 22:55:24
Message-ID: 60ac5csizn.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

pgsql(at)j-davis(dot)com (Jeff Davis) writes:
> On Tue, 2006-09-05 at 23:28 -0400, mark(at)mark(dot)mielke(dot)cc wrote:
>> On Tue, Sep 05, 2006 at 05:54:50PM -0700, Jeff Davis wrote:
>> > On Tue, 2006-09-05 at 18:24 -0400, Chris Browne wrote:
>> > > Recently seen in ACM Operating Systems Review (this is the first time
>> > > I've found as many as 1 interesting article in it in a while, and
>> > > there were 3 things I found worthwhile...):
>> > > ...
>> > > NILFS is a log-structured file system developed for Linux.
>> > As I understand LFSs, they are not ideal for a database system. An LFS
>> > is optimized so that it writes sequentially. However, PostgreSQL already
>> > ...
>> > Do you see an advantage in using LFS for PostgreSQL?
>>
>> Hey guys - I think the original poster only meant to suggest that it
>> was *interesting*... :-)
>>
>
> I see, my mistake.

From a reliability perspective, I can see some value to it...

I have seen far too many databases corrupted by journalling gone bad
in the past year... :-(

>> Applying any database on top of another database seems inefficient
>> to me. That's one reason why I argue the opposite - PostgreSQL
>> *should* have its own on disk layout, and not being laid out on top
>> of another generic system designed for purposes other than database
>> storage. The reason it isn't pursued at present, and perhaps should
>> not be pursued at present, is that PostgreSQL has other more
>> important priorities in the short term.
>
> I think that it would be a higher priority if someone showed a
> substantial performance improvement. Some filesystems don't really
> cause much overhead that isn't needed by PostgreSQL.
>
> If someone did show a substantial improvement, I would be interested
> to see it.
>
> And if there is an improvement, shouldn't that be a project for
> something like Linux, where other databases could also benefit? It
> could just be implemented as a database-specific filesystem.

The classic problem with log structured filesystems is that sequential
reads tend to be less efficient than in overwriting systems; perhaps
if they can get "vacuuming" to be done frequently enough, that might
change the shape of things.

That would be a relevant lesson that _we_ have discovered that is
potentially applicable to filesystem implementors.

And I don't consider this purely of academic interest; the ability to:
a) Avoid the double writing of journalling, and
b) Avoid the risks of failures due to misordered writes
are both genuinely valuable.
--
output = reverse("ofni.sesabatadxunil" "@" "enworbbc")
http://cbbrowne.com/info/lisp.html
All ITS machines now have hardware for a new machine instruction --
PFLT Prove Fermat's Last Theorem.
Please update your programs.


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Chris Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: New Linux Filesystem: NILFS
Date: 2006-09-06 23:58:36
Message-ID: 1157587116.20589.136.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2006-09-06 at 18:55 -0400, Chris Browne wrote:
> pgsql(at)j-davis(dot)com (Jeff Davis) writes:
> >> > Do you see an advantage in using LFS for PostgreSQL?
> >>
> >> Hey guys - I think the original poster only meant to suggest that it
> >> was *interesting*... :-)
> >>
> >
> > I see, my mistake.
>
> >From a reliability perspective, I can see some value to it...
>
> I have seen far too many databases corrupted by journalling gone bad
> in the past year... :-(
>

Can you elaborate a little? Which filesystems have been problematic?
Which filesystems are you more confident in?

> >
> > And if there is an improvement, shouldn't that be a project for
> > something like Linux, where other databases could also benefit? It
> > could just be implemented as a database-specific filesystem.
>
> The classic problem with log structured filesystems is that sequential
> reads tend to be less efficient than in overwriting systems; perhaps
> if they can get "vacuuming" to be done frequently enough, that might
> change the shape of things.
>
> That would be a relevant lesson that _we_ have discovered that is
> potentially applicable to filesystem implementors.
>
> And I don't consider this purely of academic interest; the ability to:
> a) Avoid the double writing of journalling, and
> b) Avoid the risks of failures due to misordered writes
> are both genuinely valuable.

Right, LFS is promising in a number of ways. I've read about it in the
past, and it would be nice if this NILFS implementation sparks some new
research in the area.

Regards,
Jeff Davis


From: Christopher Browne <cbbrowne(at)acm(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: New Linux Filesystem: NILFS
Date: 2006-09-07 02:12:15
Message-ID: 87ac5czaps.fsf@wolfe.cbbrowne.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

pgsql(at)j-davis(dot)com (Jeff Davis) wrote:
> On Wed, 2006-09-06 at 18:55 -0400, Chris Browne wrote:
>> pgsql(at)j-davis(dot)com (Jeff Davis) writes:
>> >> > Do you see an advantage in using LFS for PostgreSQL?
>> >>
>> >> Hey guys - I think the original poster only meant to suggest that it
>> >> was *interesting*... :-)
>> >>
>> >
>> > I see, my mistake.
>>
>> >From a reliability perspective, I can see some value to it...
>>
>> I have seen far too many databases corrupted by journalling gone bad
>> in the past year... :-(
>
> Can you elaborate a little? Which filesystems have been problematic?
> Which filesystems are you more confident in?

Well, more or less *all* of them, on AMD-64/Linux.

The "pulling the fibrechannel cable" test blew them all. XFS, ext3,
JFS. ReiserFS was, if I recall correctly, marginally better, but only
marginally.

On AIX, we have seen JFS2 falling over when there were enough levels
of buffering in the way on disk arrays.

>> > And if there is an improvement, shouldn't that be a project for
>> > something like Linux, where other databases could also benefit?
>> > It could just be implemented as a database-specific filesystem.
>>
>> The classic problem with log structured filesystems is that
>> sequential reads tend to be less efficient than in overwriting
>> systems; perhaps if they can get "vacuuming" to be done frequently
>> enough, that might change the shape of things.
>>
>> That would be a relevant lesson that _we_ have discovered that is
>> potentially applicable to filesystem implementors.
>>
>> And I don't consider this purely of academic interest; the ability to:
>> a) Avoid the double writing of journalling, and
>> b) Avoid the risks of failures due to misordered writes
>> are both genuinely valuable.
>
> Right, LFS is promising in a number of ways. I've read about it in
> the past, and it would be nice if this NILFS implementation sparks
> some new research in the area.

Indeed.

I don't see it being a "production-ready" answer yet, but yeah, I'd
certainly like to see the research continue. A vital problem is in
the area of vacuuming; there may be things to be learned in both
directions.
--
output = reverse("moc.liamg" "@" "enworbbc")
http://linuxdatabases.info/info/fs.html
Health is merely the slowest possible rate at which one can die.


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Christopher Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: New Linux Filesystem: NILFS
Date: 2006-09-07 17:16:40
Message-ID: 1157649400.20589.162.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2006-09-06 at 22:12 -0400, Christopher Browne wrote:

> > Can you elaborate a little? Which filesystems have been problematic?
> > Which filesystems are you more confident in?
>
> Well, more or less *all* of them, on AMD-64/Linux.
>
> The "pulling the fibrechannel cable" test blew them all. XFS, ext3,
> JFS. ReiserFS was, if I recall correctly, marginally better, but only
> marginally.
>
> On AIX, we have seen JFS2 falling over when there were enough levels
> of buffering in the way on disk arrays.
>

Well, that's interesting. I suppose I can't count on the filesystem as
much as I thought. Are you implying that the filesystems aren't ready on
64-bit? Is it more of a hardware issue (a controller lying about the
security of the write)? Any comments on FreeBSD/UFS+SU? I would expect
UFS+SU to have similar issues, since it depends on write ordering also.

What do you do for better data security (aside from the obvious "don't
pull cables")?

Regards,
Jeff Davis


From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: New Linux Filesystem: NILFS
Date: 2006-09-07 22:32:34
Message-ID: 6064fzs3y5.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

pgsql(at)j-davis(dot)com (Jeff Davis) writes:
> On Wed, 2006-09-06 at 22:12 -0400, Christopher Browne wrote:
>
>> > Can you elaborate a little? Which filesystems have been problematic?
>> > Which filesystems are you more confident in?
>>
>> Well, more or less *all* of them, on AMD-64/Linux.
>>
>> The "pulling the fibrechannel cable" test blew them all. XFS, ext3,
>> JFS. ReiserFS was, if I recall correctly, marginally better, but only
>> marginally.
>>
>> On AIX, we have seen JFS2 falling over when there were enough levels
>> of buffering in the way on disk arrays.
>
> Well, that's interesting. I suppose I can't count on the filesystem
> as much as I thought. Are you implying that the filesystems aren't
> ready on 64-bit?

I don't think this necessarily is a 64 bit issue; it's more that with
the more esoteric, expensive disk array hardware, there are fewer with
the ability to test it, because you need $200K worth of hardware
around to do the testing.

> Is it more of a hardware issue (a controller lying about the
> security of the write)? Any comments on FreeBSD/UFS+SU? I would
> expect UFS+SU to have similar issues, since it depends on write
> ordering also.
>
> What do you do for better data security (aside from the obvious
> "don't pull cables")?

The last time we looked, FreeBSD wasn't an option at all, because
there wasn't any suitable FibreChannel support. That may have
changed; haven't researched lately.

The trouble that the NILFS people pointed out seems a troublesome one,
namely that the more levels of cacheing (even if battery-backed), the
less certain you can be that the hardware isn't lying about write
ordering.

I haven't got an answer...
--
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://linuxdatabases.info/info/multiplexor.html
Jury -- Twelve people who determine which client has the better
lawyer.