Too-many-files errors on OS X

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: Olivier Hubaut <olivier(at)scmbb(dot)ulb(dot)ac(dot)be>
Subject: Too-many-files errors on OS X
Date: 2004-02-21 05:19:47
Message-ID: 9355.1077340787@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've looked into Olivier Hubaut's recent reports of 'Too many open
files' errors on OS X. What I find is that on Darwin, where we are
using Posix semaphores rather than SysV semaphores, each Posix semaphore
is treated as an open file --- it shows up in "lsof" output, and more to
the point it appears to count against a process's ulimit -n limit.
This means that if you are running with, say, max-connections = 100,
that's 100+ open files in the postmaster and every active backend.
And it's 100+ open files that aren't accounted for in fd.c's estimate
of how many files it can open. Since the ulimit -n setting is by
default only 256 on this platform, it doesn't take much at all for us to
be bumping up against the ulimit -n limit. fd.c copes fine, since it
automatically closes other open files any time it gets an EMFILE error.
But code outside fd.c is likely to fail hard ... which is exactly the
symptom we saw in Olivier's report.

I plan to apply some band-aid fixes to make that code more robust;
for instance we can push all calls to opendir() into fd.c so that
EMFILE can be handled by closing other open files. (And why does
MoveOfflineLogs PANIC on this anyway? It's not critical code...)

However, it seems that the real problem here is that we are so far off
base about how many files we can open. I wonder whether we should stop
relying on sysconf() and instead try to make some direct probe of the
number of files we can open. I'm imagining repeatedly open() until
failure at some point during postmaster startup, and then save that
result as the number-of-openable-files limit.

I also notice that OS X 10.3 seems to have working SysV semaphore
support. I am tempted to change template/darwin to use SysV where
available, instead of Posix semaphores. I wonder whether inheriting
100-or-so open file descriptors every time we launch a backend isn't
in itself a nasty performance hit, quite aside from its effect on how
many normal files we can open.

Comments anyone? There are a lot of unknowns here...

regards, tom lane


From: Kevin Brown <kevin(at)sysexperts(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Too-many-files errors on OS X
Date: 2004-02-22 11:57:36
Message-ID: 20040222115736.GJ2608@filer
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> However, it seems that the real problem here is that we are so far off
> base about how many files we can open. I wonder whether we should stop
> relying on sysconf() and instead try to make some direct probe of the
> number of files we can open. I'm imagining repeatedly open() until
> failure at some point during postmaster startup, and then save that
> result as the number-of-openable-files limit.

I strongly favor this method. In particular, the probe should probably
be done after all shared libraries have been loaded and initialized.

I originally thought that each shared library that was loaded would eat
a file descriptor (since I thought it would be implemented via mmap())
but that doesn't seem to be the case, at least under Linux (for those
who are curious, you can close the underlying file after you perform
the mmap() and the mapped region still works). If it's true under any
OS then it would certainly be prudent to measure the available file
descriptors after the shared libs have been loaded (another reason is
that the init function of a library might itself open a file and keep
it open, but this isn't likely to happen very often).

> I also notice that OS X 10.3 seems to have working SysV semaphore
> support. I am tempted to change template/darwin to use SysV where
> available, instead of Posix semaphores. I wonder whether inheriting
> 100-or-so open file descriptors every time we launch a backend isn't
> in itself a nasty performance hit, quite aside from its effect on how
> many normal files we can open.

I imagine this could easily be tested. I rather doubt that the
performance hit would be terribly large, but we certainly shouldn't rule
it out without testing it first.

--
Kevin Brown kevin(at)sysexperts(dot)com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kevin Brown <kevin(at)sysexperts(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Too-many-files errors on OS X
Date: 2004-02-22 17:18:47
Message-ID: 17636.1077470327@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> Tom Lane wrote:
>> I'm imagining repeatedly open() until
>> failure at some point during postmaster startup, and then save that
>> result as the number-of-openable-files limit.

> I strongly favor this method. In particular, the probe should probably
> be done after all shared libraries have been loaded and initialized.

Don't think we can postpone it that long; fd.c is a pretty basic
facility. In any case, what of shared libraries opened after postmaster
startup?

I was thinking a bit yesterday about how to account for open files
chewed up by shared libraries. A simplistic solution is just to
decrease the fd.c limit by one each time we LOAD a new shlib. (The
subroutine to do this could presumably be called by the shlib, too,
if it had its own requirements for permanently-open files.) However,
that doesn't account for shared libraries that get pulled in indirectly.
For instance, loading pltcl.so probably pulls in libtcl.so. How could
we detect that? Is it worth repeating the open-till-fail exercise every
time we load a shlib? (Seems like not, and anyway there are stability
issues if multiple backends decide to do that in parallel.)

> I originally thought that each shared library that was loaded would eat
> a file descriptor (since I thought it would be implemented via mmap())
> but that doesn't seem to be the case, at least under Linux

Hmm. This may be OS-specific. The shlibs certainly show up in the
output of lsof in every variant I've checked, but do they count against
your open-file limit?

regards, tom lane


From: Kevin Brown <kevin(at)sysexperts(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Too-many-files errors on OS X
Date: 2004-02-23 00:59:34
Message-ID: 20040223005933.GK2608@filer
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> > I originally thought that each shared library that was loaded would eat
> > a file descriptor (since I thought it would be implemented via mmap())
> > but that doesn't seem to be the case, at least under Linux
>
> Hmm. This may be OS-specific. The shlibs certainly show up in the
> output of lsof in every variant I've checked, but do they count against
> your open-file limit?

It seems not, for both shared libraries that are linked in at startup
time by the dynamic linker and shared libraries that are explicitly
opened via dlopen(). This seems to be true for Linux and Solaris (I
wasn't able to test on HP-UX, and AIX yields a strange "bad file number"
error that I've yet to track down).

Attached is the test program I used. It takes as its arguments a list
of files to hand to dlopen(), and will show how many files it was able
to open before and after running a batch of dlopen() commands.

--
Kevin Brown kevin(at)sysexperts(dot)com

Attachment Content-Type Size
eatfds.c text/x-csrc 1.1 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kevin Brown <kevin(at)sysexperts(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Too-many-files errors on OS X
Date: 2004-02-23 03:41:40
Message-ID: 20512.1077507700@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> Tom Lane wrote:
>> Hmm. This may be OS-specific. The shlibs certainly show up in the
>> output of lsof in every variant I've checked, but do they count against
>> your open-file limit?

> It seems not, for both shared libraries that are linked in at startup
> time by the dynamic linker and shared libraries that are explicitly
> opened via dlopen().

It would certainly make life a lot easier if we could assume that dlopen
doesn't reduce your open-files limit.

> Attached is the test program I used.

Can folks please try this on other platforms?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kevin Brown <kevin(at)sysexperts(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Too-many-files errors on OS X
Date: 2004-02-23 04:00:31
Message-ID: 20640.1077508831@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> I wasn't able to test on HP-UX

I get the same result on HPUX, after whacking the test program around
a bit: no change in the number of files we can open. Confirmations on
other platforms please, anyone?

For anyone else who has problems getting it to compile, try copying
the relevant version of pg_dlopen from src/backend/port/dynloader/.
I attach the code I actually ran on HPUX.

regards, tom lane

#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <stdlib.h>
#include <unistd.h>
//#include <dlfcn.h>
// these seem to be needed on HPUX:
#include <a.out.h>
#include <dl.h>

int *fd;
int size = 1024;

void *
pg_dlopen(char *filename)
{
/*
* Use BIND_IMMEDIATE so that undefined symbols cause a failure return
* from shl_load(), rather than an abort() later on when we attempt to
* call the library!
*/
shl_t handle = shl_load(filename,
BIND_IMMEDIATE | BIND_VERBOSE | DYNAMIC_PATH,
0L);

return (void *) handle;
}

int eatallfds(void) {
int i = 0;
int j, myfd;

while (1) {
myfd = dup(0);
if (myfd < 0) {
fprintf (stderr, "dup() failed: %s\n", strerror(errno));
break;
}
fd[i++] = myfd;
if (i >= size) {
size *= 2;
fd = realloc(fd, size);
if (fd == NULL) {
fprintf (stderr, "Can't allocate: %s\n",
strerror(errno));
fprintf (stderr, "Had used %d descriptors\n",
i);
exit(1);
}
}
}
for (j = 0 ; j < i ; ++j) {
close(fd[j]);
}
return i;
}

int main (int argc, char *argv[]) {
int n, na;
int i;
void *addr;

size = 1024;
fd = malloc(size * sizeof(*fd));
if (fd == NULL) {
fprintf (stderr, "Can't allocate: %s\n", strerror(errno));
return 1;
}
n = eatallfds();
printf ("Was able to use %d file descriptors\n", n);

na = 0;
for (i = 1 ; i < argc ; ++i) {
addr = pg_dlopen(argv[i]);
if (addr != NULL) na++;
}
n = eatallfds();
printf ("Was able to use %d file descriptors after opening %d shared libs\n", n, na);
return 0;
}


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Brown <kevin(at)sysexperts(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Too-many-files errors on OS X
Date: 2004-02-23 05:28:02
Message-ID: 40398F62.8060304@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Confirmations on other platforms please, anyone?
>
> For anyone else who has problems getting it to compile, try copying
> the relevant version of pg_dlopen from src/backend/port/dynloader/. I
> attach the code I actually ran on HPUX.

FWIW:

RH9
-------------------
# ./eatallfds libperl.so libR.so libtcl.so
dup() failed: Too many open files
Was able to use 1021 file descriptors
dup() failed: Too many open files
Was able to use 1021 file descriptors after opening 3 shared libs

Fedora
-------------------
# ./eatallfds libR.so libtcl.so libperl.so
dup() failed: Too many open files
Was able to use 1021 file descriptors
dup() failed: Too many open files
Was able to use 1021 file descriptors after opening 3 shared libs

Joe


From: Larry Rosenman <ler(at)lerctr(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Brown <kevin(at)sysexperts(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Too-many-files errors on OS X
Date: 2004-02-23 10:49:02
Message-ID: 29880000.1077533342@lerlaptop.lerctr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

--On Sunday, February 22, 2004 23:00:31 -0500 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
wrote:

> Kevin Brown <kevin(at)sysexperts(dot)com> writes:
>> I wasn't able to test on HP-UX
>
> I get the same result on HPUX, after whacking the test program around
> a bit: no change in the number of files we can open. Confirmations on
> other platforms please, anyone?
>
> For anyone else who has problems getting it to compile, try copying
> the relevant version of pg_dlopen from src/backend/port/dynloader/.
> I attach the code I actually ran on HPUX.
>
> regards, tom lane
>
On FreeBSD 5:

$ ./eatfds3 /usr/local/lib/libpq.so /usr/lib/libm.so
dup() failed: Too many open files
Was able to use 7146 file descriptors
dup() failed: Too many open files
Was able to use 7146 file descriptors after opening 2 shared libs
$

On UnixWare 7.1.4:
$ ./eatfds3 /usr/lib/libpq.so.3 /usr/lib/libm.so.1
dup() failed: Too many open files
Was able to use 2045 file descriptors
dup() failed: Too many open files
Was able to use 2045 file descriptors after opening 2 shared libs
$

I had to hack on the code some more for FreeBSD:
(the realloc call needed the multiplication). I ran this same code
on UnixWare.

$ cat eatfds3.c
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <stdlib.h>
#include <unistd.h>
#include <dlfcn.h>
// these seem to be needed on HPUX:
//#include <a.out.h>
//#include <dl.h>

int *fd;
int size = 3072;

void *
pg_dlopen(char *filename)
{
/*
* Use BIND_IMMEDIATE so that undefined symbols cause a failure
return
* from shl_load(), rather than an abort() later on when we attempt
to
* call the library!
*/
caddr_t handle = dlopen(filename,
RTLD_LAZY);

return (void *) handle;
}

int eatallfds(void) {
int i = 0;
int j, myfd;

while (1) {
myfd = dup(0);
if (myfd < 0) {
fprintf (stderr, "dup() failed: %s\n",
strerror(errno));
break;
}
if (i >= size) {
size *= 2;
fd = realloc(fd, size * sizeof(*fd));
if (fd == NULL) {
fprintf (stderr, "Can't allocate: %s\n",
strerror(errno));
fprintf (stderr, "Had used %d
descriptors\n",
i);
exit(1);
}
}
fd[i++] = myfd;
}
for (j = 0 ; j < i ; ++j) {
close(fd[j]);
}
return i;
}

int main (int argc, char *argv[]) {
int n, na;
int i;
void *addr;

size = 3072;
fd = malloc((size + 1) * sizeof(*fd));
if (fd == NULL) {
fprintf (stderr, "Can't allocate: %s\n", strerror(errno));
return 1;
}
n = eatallfds();
printf ("Was able to use %d file descriptors\n", n);

na = 0;
for (i = 1 ; i < argc ; ++i) {
addr = pg_dlopen(argv[i]);
if (addr != NULL) na++;
}
n = eatallfds();
printf ("Was able to use %d file descriptors after opening %d
shared libs\n", n, na);
return 0;
}

$

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: ler(at)lerctr(dot)org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749


From: Kevin Brown <kevin(at)sysexperts(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Too-many-files errors on OS X
Date: 2004-02-23 12:52:09
Message-ID: 20040223125208.GL2608@filer
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Larry Rosenman wrote:
> I had to hack on the code some more for FreeBSD:
> (the realloc call needed the multiplication). I ran this same code
> on UnixWare.

I feel like a moron, having missed that. Probably explains the "bad
file number" error I was getting on AIX, too...

--
Kevin Brown kevin(at)sysexperts(dot)com


From: Kevin Brown <kevin(at)sysexperts(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Too-many-files errors on OS X
Date: 2004-02-23 13:02:35
Message-ID: 20040223130235.GM2608@filer
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Larry Rosenman wrote:
> > I had to hack on the code some more for FreeBSD:
> > (the realloc call needed the multiplication). I ran this same code
> > on UnixWare.
>
> I feel like a moron, having missed that. Probably explains the "bad
> file number" error I was getting on AIX, too...

And sure enough, that was it. Got the same results on AIX 5 as on other
systems:

kbrown(at)m048:~$ ./eatfds /usr/lib/librpm.so.0 /usr/lib/librpmbuild.so.0
dup() failed: Too many open files
Was able to use 1997 file descriptors
dup() failed: Too many open files
Was able to use 1997 file descriptors after opening 2 shared libs
kbrown(at)m048:~$ uname -a
AIX m048 1 5 0001063A4C00

--
Kevin Brown kevin(at)sysexperts(dot)com


From: Andrew Rawnsley <ronz(at)ravensfield(dot)com>
To: Hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Too-many-files errors on OS X
Date: 2004-02-23 13:12:31
Message-ID: F02442DD-6601-11D8-84E2-000393A47FCC@ravensfield.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Slackware 8.1:
ronz(at)steelhead:~/src$ ./eatallfds libm.so libtcl.so libjpeg.so
dup() failed: Too many open files
Was able to use 1021 file descriptors
dup() failed: Too many open files
Was able to use 1021 file descriptors after opening 3 shared libs

On OpenBSD 3.1:
grayling# ./eatallfds libcrypto.so.10.0 libkrb5.so.13.0
libncurses.so.9.0
dup() failed: Too many open files
Was able to use 125 file descriptors
dup() failed: Too many open files
Was able to use 125 file descriptors after opening 3 shared libs

On Feb 22, 2004, at 10:41 PM, Tom Lane wrote:

> Kevin Brown <kevin(at)sysexperts(dot)com> writes:
>> Tom Lane wrote:
>>> Hmm. This may be OS-specific. The shlibs certainly show up in the
>>> output of lsof in every variant I've checked, but do they count
>>> against
>>> your open-file limit?
>
>> It seems not, for both shared libraries that are linked in at startup
>> time by the dynamic linker and shared libraries that are explicitly
>> opened via dlopen().
>
> It would certainly make life a lot easier if we could assume that
> dlopen
> doesn't reduce your open-files limit.
>
>> Attached is the test program I used.
>
> Can folks please try this on other platforms?
>
> regards, tom lane
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 8: explain analyze is your friend
>
--------------------

Andrew Rawnsley
President
The Ravensfield Digital Resource Group, Ltd.
(740) 587-0114
www.ravensfield.com


From: Larry Rosenman <ler(at)lerctr(dot)org>
To: Kevin Brown <kevin(at)sysexperts(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Too-many-files errors on OS X
Date: 2004-02-23 13:31:23
Message-ID: 19600000.1077543083@lerlaptop.lerctr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

--On Monday, February 23, 2004 04:52:09 -0800 Kevin Brown
<kevin(at)sysexperts(dot)com> wrote:

> Larry Rosenman wrote:
>> I had to hack on the code some more for FreeBSD:
>> (the realloc call needed the multiplication). I ran this same code
>> on UnixWare.
>
> I feel like a moron, having missed that. Probably explains the "bad
> file number" error I was getting on AIX, too...
>
It was a coredump for me, which is why I had to look at it,
and it took a while :-)

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: ler(at)lerctr(dot)org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kevin Brown <kevin(at)sysexperts(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Too-many-files errors on OS X
Date: 2004-02-24 01:02:33
Message-ID: 15098.1077584553@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> Tom Lane wrote:
>> However, it seems that the real problem here is that we are so far off
>> base about how many files we can open. I wonder whether we should stop
>> relying on sysconf() and instead try to make some direct probe of the
>> number of files we can open. I'm imagining repeatedly open() until
>> failure at some point during postmaster startup, and then save that
>> result as the number-of-openable-files limit.

> I strongly favor this method. In particular, the probe should probably
> be done after all shared libraries have been loaded and initialized.

I've now committed changes in 7.4 and HEAD branches to do this. Per the
recent tests, the code does not worry about tracking dlopen() calls, but
assumes that loading a shared library has no long-term impact on the
available number of FDs.

>> I also notice that OS X 10.3 seems to have working SysV semaphore
>> support. I am tempted to change template/darwin to use SysV where
>> available, instead of Posix semaphores. I wonder whether inheriting
>> 100-or-so open file descriptors every time we launch a backend isn't
>> in itself a nasty performance hit, quite aside from its effect on how
>> many normal files we can open.

> I imagine this could easily be tested.

In some simplistic tests, I couldn't find any clear difference in
backend startup time on Darwin with max_connections set to 5 vs 100.
So the idea that the extra FDs hurt us on backend startup seems wrong.
I am still a bit concerned about the possible impact of having an
unreasonably small number of available FDs, but against that we also
would have to determine whether Posix semaphores might be faster than
SysV semaphores on Darwin. I think I'll leave well enough alone unless
someone feels like running some benchmarks.

regards, tom lane


From: abe(at)purdue(dot)edu (Vic Abell)
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Too-many-files errors on OS X
Date: 2004-02-24 13:44:20
Message-ID: 825ee26b.0402240544.46ff0e91@posting.google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

tgl(at)sss(dot)pgh(dot)pa(dot)us (Tom Lane) wrote in message (in part)
> ...
> Hmm. This may be OS-specific. The shlibs certainly show up in the
> output of lsof in every variant I've checked, but do they count against
> your open-file limit?

From the lsof FAQ:

> 5.2 Why doesn't Apple Darwin lsof report text file information?
>
> At the first port of lsof to Apple Darwin, revision 4.53,
> insufficient information was available -- logic and header
> files -- to permit the installation of VM space scanning
> for text files. As of lsof 4.70 it is sill not available.
> Text file support will be added to Apple Darwin lsof after
> the necessary information becomes available.

Lsof calls the executable and shared libraries "text files." The
lsof FAQ may be found at:

ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/FAQ

I have developed a hack which will be released at lsof revision
4.71. A pre-release source distribution of 4.71 only for Darwin is
available at:

ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/NEW/lsof_4.71C.darwin.tar.bz2

Note that you must build the lsof executable from that distribution
and building lsof requires that you download the XNU headers from
www.opensource.apple.com/darwinsource/. Downloading the XNU headers
requires an Apple ID and password.

Vic Abell, lsof author