Re: PG on NFS may be just a bad idea

Lists: pgsql-docspgsql-hackerspgsql-novice
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org, pgsql-novice(at)postgreSQL(dot)org, Mija Lee <mija(at)scharp(dot)org>
Subject: PG on NFS may be just a bad idea
Date: 2007-09-29 03:58:22
Message-ID: 25517.1191038302@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs pgsql-hackers pgsql-novice

I spent a bit of time tonight poking at the issue reported here:
http://archives.postgresql.org/pgsql-novice/2007-08/msg00123.php

It turns out to be quite easy to reproduce, at least for me: start CVS
HEAD on an NFS-mounted $PGDATA directory, and run the contrib regression
tests ("make installcheck" in contrib/). I see more than half of the
DROP DATABASE commands complaining in exactly the way Miya describes.
This failure rate might be an artifact of the particular environment
(I tested NFS client = Fedora Core 6, server = HPUX 10.20 on a much
slower machine) but the problem is clearly real.

In the earlier thread I cited suggestions that this behavior comes from
client programs holding files open longer than they should. However,
strace'ing this behavior shows no evidence at all that that is happening
in Postgres. I have an strace that shows conclusively that the bgwriter
never opened any file in the target database at all, and all earlier
backends exited before the one doing the DROP DATABASE began its dirty
work, and yet:

[pid 19211] 22:50:30.517077 rmdir("base/18193") = -1 ENOTEMPTY (Directory not empty)
[pid 19211] 22:50:30.517863 write(2, "WARNING: could not remove file "..., 79WARNING: could not remove file or directory "base/18193": Directory not empty
) = 79
[pid 19211] 22:50:30.517974 sendto(7, "N\0\0\0rSWARNING\0C01000\0Mcould not "..., 115, 0, NULL, 0) = 115

After some googling I think that the damage may actually be getting done
at the kernel level. According to
http://www.time-travellers.org/shane/papers/NFS_considered_harmful.html
it is fairly common for NFS clients to cache writes, meaning that the
kernel itself may be holding an old write and not sending it to the NFS
server until after the file deletion command has been sent.

(I don't have the network-fu needed to prove that this is happening by
sniffing the network traffic; anyone want to try?)

If this is what's happening I'd claim it is a kernel bug, but seeing
that I see it on FC6 and Miya sees it on Solaris 10, it would be a bug
widespread enough that we'd not be likely to get it killed off soon.

Maybe we need to actively discourage people from running Postgres
against NFS-mounted data directories. Shane Kerr's paper cited above
mentions some other rather scary properties, including O_EXCL file
creation not really working properly.

regards, tom lane


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-novice(at)postgresql(dot)org, Mija Lee <mija(at)scharp(dot)org>
Subject: Re: PG on NFS may be just a bad idea
Date: 2007-09-29 14:23:59
Message-ID: 46FE5FFF.6070606@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs pgsql-hackers pgsql-novice

Tom Lane wrote:

>
> If this is what's happening I'd claim it is a kernel bug, but seeing
> that I see it on FC6 and Miya sees it on Solaris 10, it would be a bug
> widespread enough that we'd not be likely to get it killed off soon.
>

I think my colleague was solving similar issue in JavaDB. IIRC the
problem is in how NFS works and conclusion was do not use JavaDB (Derby)
on NFS. I forwarded this issue to our NFS gurus and I will send updated
information.

Zdenek


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org, pgsql-novice(at)postgreSQL(dot)org, Mija Lee <mija(at)scharp(dot)org>
Subject: Re: [HACKERS] PG on NFS may be just a bad idea
Date: 2007-10-01 17:13:41
Message-ID: 47012AC5.9040808@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs pgsql-hackers pgsql-novice

Tom,

> Maybe we need to actively discourage people from running Postgres
> against NFS-mounted data directories. Shane Kerr's paper cited above
> mentions some other rather scary properties, including O_EXCL file
> creation not really working properly.

Wouldn't you be describing a Linux-specific issue, though? And possibly
kernel-specific?

It's hard to reconcile this with the real-world performance of
PostgreSQL on NFS, which is happening all over the place. Most notably,
Joe Conway's 20,000 txn/sec.

I *do* think it's an accurate statement that if you're going to use
Postgres, or any other OLTP database, on NFS you'd better have access to
a NAS expert. But to say that it's a bad idea even if you have expert
help is probably going to far.

--Josh Berkus


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org, pgsql-novice(at)postgreSQL(dot)org, Mija Lee <mija(at)scharp(dot)org>
Subject: Re: [HACKERS] PG on NFS may be just a bad idea
Date: 2007-10-01 17:24:41
Message-ID: 18214.1191259481@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs pgsql-hackers pgsql-novice

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> Maybe we need to actively discourage people from running Postgres
>> against NFS-mounted data directories.

> It's hard to reconcile this with the real-world performance of
> PostgreSQL on NFS, which is happening all over the place. Most notably,
> Joe Conway's 20,000 txn/sec.

This is not a question of performance, it is a question of whether you
are willing to tolerate corner-case misbehaviors.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org, pgsql-novice(at)postgreSQL(dot)org, Mija Lee <mija(at)scharp(dot)org>
Subject: Re: PG on NFS may be just a bad idea
Date: 2007-10-01 18:36:54
Message-ID: 1191263814.4260.57.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs pgsql-hackers pgsql-novice

On Mon, 2007-10-01 at 10:13 -0700, Josh Berkus wrote:

> > Maybe we need to actively discourage people from running Postgres
> > against NFS-mounted data directories. Shane Kerr's paper cited above
> > mentions some other rather scary properties, including O_EXCL file
> > creation not really working properly.
>
> Wouldn't you be describing a Linux-specific issue, though? And possibly
> kernel-specific?

Possibly, though if you have any specific refutations of the Kerr paper
then it would be a good idea to air them. It isn't enough to just hint
some exist.

> It's hard to reconcile this with the real-world performance of
> PostgreSQL on NFS, which is happening all over the place. Most notably,
> Joe Conway's 20,000 txn/sec.
>
> I *do* think it's an accurate statement that if you're going to use
> Postgres, or any other OLTP database, on NFS you'd better have access to
> a NAS expert. But to say that it's a bad idea even if you have expert
> help is probably going to far.

I can see many papers on database performance on NFS, but I don't see
any discussion of potential reliability concerns. If anybody sits near
an NAS expert, it would be great to have that discussion.

I have found some comments that other databases require "specific
configuration settings to ensure efficient and correct usage" of NFS "to
access NAS storage devices".

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org, Mija Lee <mija(at)scharp(dot)org>
Subject: Re: PG on NFS may be just a bad idea
Date: 2007-10-01 18:54:48
Message-ID: 1191264888.4260.69.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs pgsql-hackers pgsql-novice

On Mon, 2007-10-01 at 19:36 +0100, Simon Riggs wrote:

> > I *do* think it's an accurate statement that if you're going to use
> > Postgres, or any other OLTP database, on NFS you'd better have access to
> > a NAS expert. But to say that it's a bad idea even if you have expert
> > help is probably going to far.
>
> I can see many papers on database performance on NFS, but I don't see
> any discussion of potential reliability concerns. If anybody sits near
> an NAS expert, it would be great to have that discussion.

http://blogs.netapp.com/dave/2007/08/oracle-optimize.html

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Mija Lee <mija(at)scharp(dot)org>
Subject: Re: PG on NFS may be just a bad idea
Date: 2007-10-01 23:25:16
Message-ID: 24184.1191281116@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs pgsql-hackers pgsql-novice

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> http://blogs.netapp.com/dave/2007/08/oracle-optimize.html

Not a whole lot of technical content there, but pretty interesting
nonetheless. I *think* that the issues we're seeing are largely in the
NFS client-side kernel code, so bypassing that stack as Oracle is doing
might eliminate the problem. Of course, there's a sizable amount of
code to be written to do that ...

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Mija Lee <mija(at)scharp(dot)org>
Subject: Re: PG on NFS may be just a bad idea
Date: 2007-10-01 23:46:52
Message-ID: 20071001234651.GD9430@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs pgsql-hackers pgsql-novice

Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > http://blogs.netapp.com/dave/2007/08/oracle-optimize.html
>
> Not a whole lot of technical content there, but pretty interesting
> nonetheless. I *think* that the issues we're seeing are largely in the
> NFS client-side kernel code, so bypassing that stack as Oracle is doing
> might eliminate the problem. Of course, there's a sizable amount of
> code to be written to do that ...

Yeah. Next step we will be writing our own malloc.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Mija Lee <mija(at)scharp(dot)org>
Subject: Re: PG on NFS may be just a bad idea
Date: 2007-10-08 20:46:26
Message-ID: 200710082046.l98KkQj28574@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs pgsql-hackers pgsql-novice

Alvaro Herrera wrote:
> Tom Lane wrote:
> > Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > > http://blogs.netapp.com/dave/2007/08/oracle-optimize.html
> >
> > Not a whole lot of technical content there, but pretty interesting
> > nonetheless. I *think* that the issues we're seeing are largely in the
> > NFS client-side kernel code, so bypassing that stack as Oracle is doing
> > might eliminate the problem. Of course, there's a sizable amount of
> > code to be written to do that ...
>
> Yeah. Next step we will be writing our own malloc.

I assume there should be a ;-) in there because we already have our own
malloc (palloc).

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Mija Lee <mija(at)scharp(dot)org>
Subject: Re: PG on NFS may be just a bad idea
Date: 2007-10-08 20:50:35
Message-ID: 20071008205035.GF6000@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs pgsql-hackers pgsql-novice

Bruce Momjian wrote:
> Alvaro Herrera wrote:
> > Tom Lane wrote:
> > > Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > > > http://blogs.netapp.com/dave/2007/08/oracle-optimize.html
> > >
> > > Not a whole lot of technical content there, but pretty interesting
> > > nonetheless. I *think* that the issues we're seeing are largely in the
> > > NFS client-side kernel code, so bypassing that stack as Oracle is doing
> > > might eliminate the problem. Of course, there's a sizable amount of
> > > code to be written to do that ...
> >
> > Yeah. Next step we will be writing our own malloc.
>
> I assume there should be a ;-) in there because we already have our own
> malloc (palloc).

Yeah, some sort of smiley should be there. But what I'm talking about
is rewriting the underlying memory allocation mechanism, just like we
would be rewriting the NFS client.

palloc uses malloc underneath. My thought is to replace that with
sbrk, mmap or something like that. Not very portable though, a lot of
work, and most likely not nearly enough benefits.

--
Alvaro Herrera http://www.PlanetPostgreSQL.org/
"Nadie esta tan esclavizado como el que se cree libre no siendolo" (Goethe)


From: Neil Conway <neilc(at)samurai(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Mija Lee <mija(at)scharp(dot)org>
Subject: Re: PG on NFS may be just a bad idea
Date: 2007-10-08 21:22:17
Message-ID: 1191878537.26227.4.camel@dell.linuxdev.us.dell.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs pgsql-hackers pgsql-novice

On Mon, 2007-10-08 at 16:50 -0400, Alvaro Herrera wrote:
> palloc uses malloc underneath. My thought is to replace that with
> sbrk, mmap or something like that. Not very portable though, a lot of
> work, and most likely not nearly enough benefits.

Yeah, I agree this isn't likely to be a win in the general case.
However, it would be interesting to explore a specialized allocator for
short-lived memory contexts, where we don't care about having an
effective pfree(). If the context is going to be reset or deleted
shortly anyway, we could probably optimize and simplify palloc() by
skipping free space accounting and then make pfree() a no-op. I recall
Tom mentioning something to this effect a few months back...

-Neil


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Mija Lee <mija(at)scharp(dot)org>
Subject: Re: PG on NFS may be just a bad idea
Date: 2007-10-09 07:44:21
Message-ID: 1191915861.4223.601.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs pgsql-hackers pgsql-novice

On Mon, 2007-10-01 at 19:25 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > http://blogs.netapp.com/dave/2007/08/oracle-optimize.html
>
> Not a whole lot of technical content there, but pretty interesting
> nonetheless. I *think* that the issues we're seeing are largely in the
> NFS client-side kernel code, so bypassing that stack as Oracle is doing
> might eliminate the problem. Of course, there's a sizable amount of
> code to be written to do that ...

Yeh, that would take a while.

I thought of another reason to do that also.

If you put a tablespace on an NFS mount and the remote server crashes,
it sounds like there could be a window of potential data loss. We could
guard against that by recovering the tablespace, but we don't do that
unless the local server crashes.

So having your own NFS client would allow you to tell that the link had
dropped and needed to be recovered.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-documentation <pgsql-docs(at)postgresql(dot)org>
Subject: Re: [HACKERS] PG on NFS may be just a bad idea
Date: 2007-11-04 21:51:38
Message-ID: 200711042151.lA4LpcP29113@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs pgsql-hackers pgsql-novice


Based on this analysis, I have added an NFS section to the tablespaces
portion of the documentation, and linked to it from 'Creating a database
cluster'. Patch attached.

---------------------------------------------------------------------------

Tom Lane wrote:
> I spent a bit of time tonight poking at the issue reported here:
> http://archives.postgresql.org/pgsql-novice/2007-08/msg00123.php
>
> It turns out to be quite easy to reproduce, at least for me: start CVS
> HEAD on an NFS-mounted $PGDATA directory, and run the contrib regression
> tests ("make installcheck" in contrib/). I see more than half of the
> DROP DATABASE commands complaining in exactly the way Miya describes.
> This failure rate might be an artifact of the particular environment
> (I tested NFS client = Fedora Core 6, server = HPUX 10.20 on a much
> slower machine) but the problem is clearly real.
>
> In the earlier thread I cited suggestions that this behavior comes from
> client programs holding files open longer than they should. However,
> strace'ing this behavior shows no evidence at all that that is happening
> in Postgres. I have an strace that shows conclusively that the bgwriter
> never opened any file in the target database at all, and all earlier
> backends exited before the one doing the DROP DATABASE began its dirty
> work, and yet:
>
> [pid 19211] 22:50:30.517077 rmdir("base/18193") = -1 ENOTEMPTY (Directory not empty)
> [pid 19211] 22:50:30.517863 write(2, "WARNING: could not remove file "..., 79WARNING: could not remove file or directory "base/18193": Directory not empty
> ) = 79
> [pid 19211] 22:50:30.517974 sendto(7, "N\0\0\0rSWARNING\0C01000\0Mcould not "..., 115, 0, NULL, 0) = 115
>
> After some googling I think that the damage may actually be getting done
> at the kernel level. According to
> http://www.time-travellers.org/shane/papers/NFS_considered_harmful.html
> it is fairly common for NFS clients to cache writes, meaning that the
> kernel itself may be holding an old write and not sending it to the NFS
> server until after the file deletion command has been sent.
>
> (I don't have the network-fu needed to prove that this is happening by
> sniffing the network traffic; anyone want to try?)
>
> If this is what's happening I'd claim it is a kernel bug, but seeing
> that I see it on FC6 and Miya sees it on Solaris 10, it would be a bug
> widespread enough that we'd not be likely to get it killed off soon.
>
> Maybe we need to actively discourage people from running Postgres
> against NFS-mounted data directories. Shane Kerr's paper cited above
> mentions some other rather scary properties, including O_EXCL file
> creation not really working properly.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachment Content-Type Size
/rtmp/diff text/x-diff 2.4 KB