Mac OS X: system shutdown prevents checkpoint

Lists: pgsql-generalpgsql-hackers
From: Francois Suter <dba(at)paragraf(dot)ch>
To: pgsql-general(at)postgresql(dot)org
Subject: pid gets overwritten in OSX
Date: 2002-04-25 07:47:49
Message-ID: v04210101b8ed639c5d35@[192.168.1.34]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi,

I'm running Postgres on Mac OSX (10.1.4). Every once in a while, I
get the following problem: for some reason the postmaster seems to
stop running postgres. When I look at the pid attributed to postgres
(in postmaster.pid) and check it against ps -aux, I see that either
the process doesn't exist anymore or that it has been overwritten by
some other program (e.g. MySQL). It's not a big problem since it is
enough to restart for the pids to get sorted (just once the problem
happened twice in a row), but does anyone have an idea how I could
avoid this?

Thanks.

--------
François

Home page: http://www.monpetitcoin.com/
"A fox is a wolf who sends flowers"


From: Gregory Seidman <gss+pg(at)cs(dot)brown(dot)edu>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: pid gets overwritten in OSX
Date: 2002-04-25 12:41:54
Message-ID: 20020425084154.A26384@jamaica.cs.brown.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Francois Suter sez:
} I'm running Postgres on Mac OSX (10.1.4). Every once in a while, I
} get the following problem: for some reason the postmaster seems to
} stop running postgres. When I look at the pid attributed to postgres
} (in postmaster.pid) and check it against ps -aux, I see that either
} the process doesn't exist anymore or that it has been overwritten by
} some other program (e.g. MySQL). It's not a big problem since it is
} enough to restart for the pids to get sorted (just once the problem
} happened twice in a row), but does anyone have an idea how I could
} avoid this?

You'll have to provide more information. I am running OSX 10.1.4 and both
PostgreSQL 7.1.2 and MySQL and I have never seen any such behavior. The
only way I could even imagine them interacting is if you are trying to use
the same directory for both, and even then it shouldn't happen since MySQL
and PostgreSQL use different naming schemes for their pid files.

Is it possible that PostgreSQL isn't coming up after a reboot and the pid
file just happens to have an old pid from the last boot?

} Thanks.
} François
--Greg


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Francois Suter <dba(at)paragraf(dot)ch>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: pid gets overwritten in OSX
Date: 2002-04-25 15:44:49
Message-ID: 200204251544.g3PFinp17743@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Francois Suter wrote:
> Hi,
>
> I'm running Postgres on Mac OSX (10.1.4). Every once in a while, I
> get the following problem: for some reason the postmaster seems to
> stop running postgres. When I look at the pid attributed to postgres
> (in postmaster.pid) and check it against ps -aux, I see that either
> the process doesn't exist anymore or that it has been overwritten by
> some other program (e.g. MySQL). It's not a big problem since it is
> enough to restart for the pids to get sorted (just once the problem
> happened twice in a row), but does anyone have an idea how I could
> avoid this?

That is strange. The odds that a pid would get reused by another
long-running program, and that it would be another database, is very
small.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Francois Suter <dba(at)paragraf(dot)ch>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: pid gets overwritten in OSX
Date: 2002-04-26 06:52:33
Message-ID: v04210100b8eea88562f7@[192.168.1.34]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

>You'll have to provide more information. I am running OSX 10.1.4 and both
>PostgreSQL 7.1.2 and MySQL and I have never seen any such behavior. The
>only way I could even imagine them interacting is if you are trying to use
>the same directory for both, and even then it shouldn't happen since MySQL
>and PostgreSQL use different naming schemes for their pid files.

No, I'm definitely not using the same directory for both. As for more
info, I'm using Postgres 7.2.

>Is it possible that PostgreSQL isn't coming up after a reboot and the pid
>file just happens to have an old pid from the last boot?

It could be. I have been thinking along this line. I could imagine
the following scenario: Postgres starts after quite a few other
processes, tries to start with the pid stored in the postmaster.pid
file and actually doesn't start because the pid is already in use. Is
there an error log somewhere where such an error might appear?

Thanks.

--------
François

Home page: http://www.monpetitcoin.com/
"A fox is a wolf who sends flowers"


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Francois Suter <dba(at)paragraf(dot)ch>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: pid gets overwritten in OSX
Date: 2002-04-26 14:46:24
Message-ID: 26046.1019832384@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Francois Suter <dba(at)paragraf(dot)ch> writes:
> the following scenario: Postgres starts after quite a few other
> processes, tries to start with the pid stored in the postmaster.pid
> file and actually doesn't start because the pid is already in use.

Postgres does not "try to start with the stored pid"; that's entirely
impossible under any flavor of Unix. You get the PID the kernel assigns
you, and that's that. This could well be a problem of failure to start
up, but you're barking up the wrong tree as to why.

What is needed at this point is more observation. You need to determine
whether the postmaster is in fact starting (and later dying) or
failing to start at all --- ie, is the postmaster.pid file left over
from a previous system boot cycle? Checking the mod date of the pid
file might be enough to tell.

> Is there an error log somewhere where such an error might appear?

What are you doing with the postmaster's stderr? If your start script
for the postmaster is routing it to /dev/null, send it someplace more
helpful.

regards, tom lane


From: Francois Suter <dba(at)paragraf(dot)ch>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: pid gets overwritten in OSX
Date: 2002-04-27 10:26:09
Message-ID: v04210104b8f02ceb8918@[192.168.1.34]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Thanks for the leads. I will investigate for a while and keep you
posted if I find anything that might be of interest to everybody.

>What is needed at this point is more observation. You need to determine
>whether the postmaster is in fact starting (and later dying) or
>failing to start at all --- ie, is the postmaster.pid file left over
>from a previous system boot cycle? Checking the mod date of the pid
>file might be enough to tell.
>
>What are you doing with the postmaster's stderr? If your start script
>for the postmaster is routing it to /dev/null, send it someplace more
>helpful.

--------
François

Home page: http://www.monpetitcoin.com/
"A fox is a wolf who sends flowers"


From: Francois Suter <dba(at)paragraf(dot)ch>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: pid gets overwritten in OSX
Date: 2002-04-29 08:46:49
Message-ID: v04210102b8f2b789102f@[192.168.1.34]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

>>What is needed at this point is more observation. You need to determine
>>whether the postmaster is in fact starting (and later dying) or
>>failing to start at all --- ie, is the postmaster.pid file left over
>>from a previous system boot cycle? Checking the mod date of the pid
>>file might be enough to tell.

The error happened again during the week-end and I was able to
collect the following from Postgres' logfile:

Lock file "/usr/local/pgsql/data/postmaster.pid" already exists.
Is another postmaster (pid 217) running in "/usr/local/pgsql/data"?

So it seems that the problem is that the postmaster.pid file can't be
overwritten. I checked the last mod date and it is indeed left over
from last startup. Any idea what could be causing this problem?

--------
François

Home page: http://www.monpetitcoin.com/
"A fox is a wolf who sends flowers"


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Francois Suter <dba(at)paragraf(dot)ch>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: pid gets overwritten in OSX
Date: 2002-04-29 14:28:41
Message-ID: 23554.1020090521@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Francois Suter <dba(at)paragraf(dot)ch> writes:
> The error happened again during the week-end and I was able to=20
> collect the following from Postgres' logfile:

> Lock file "/usr/local/pgsql/data/postmaster.pid" already exists.
> Is another postmaster (pid 217) running in "/usr/local/pgsql/data"?

> So it seems that the problem is that the postmaster.pid file can't be=20
> overwritten. I checked the last mod date and it is indeed left over=20
> from last startup. Any idea what could be causing this problem?

Well, it *could* be overwritten, but Postgres won't do it if it sees
that there is a process of that PID in the system.

What I think is happening is that there's some small variation in the
number or ordering of processes launched during system boot. Maybe one
time Postgres is PID 217, the next time it is PID 218 and some other
daemon happens to get 217. But if 217 is what is in the lockfile, and
we see *any* other existent process with PID 217, we cravenly refuse
to overwrite the lockfile.

I have seen this sort of thing before with other daemons --- on my
system, sendmail occasionally refuses to start after a power failure &
reboot because it has the same sort of lockfile checking behavior.

We could perhaps avoid this scenario by being a little tighter about
what we will believe is a conflicting process --- for example, if PID
217 exists but isn't our same userID, don't assume it's the old
postmaster still running. But I could easily see that cure being worse
than the disease. If it ever let us start two conflicting postmasters
in the same data directory, data corruption would be the certain result.
That's exactly what the lockfile is there to prevent.

The real problem is that the old postmaster was evidently not allowed
to shut down cleanly (else it'd have removed its lockfile). How are
you powering down the system, anyway?

regards, tom lane


From: Francois Suter <dba(at)paragraf(dot)ch>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: pid gets overwritten in OSX
Date: 2002-04-29 15:05:27
Message-ID: v04210106b8f30f6cb65a@[192.168.1.34]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

>The real problem is that the old postmaster was evidently not allowed
>to shut down cleanly (else it'd have removed its lockfile). How are
>you powering down the system, anyway?

I'm shutting down normally (ok, I mean most of the time I press the
power-up button and choose "Shut down" rather than going via the
Apple menu). I haven't had a system crash in ages! The only
difference I can see (and I would have to test if it makes any
difference) is that sometimes I'm working stand-alone at home and
sometimes on the network in my office (I'm using a PowerBook G4), but
I'm pretty sure I don't have this problem popping up everytime I go
back to the office after having used my machine at home.

Maybe there's some operation missing at shutdown. I installed
PostgreSQL using Mark Liyanage's package. Could there be something
missing? Is Postgres taking care of the removal of the postmaster.pid
file or do you have to do it yourself in some shutdown script?

Best regards.

--------
François

Home page: http://www.monpetitcoin.com/
"A fox is a wolf who sends flowers"


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Francois Suter <dba(at)paragraf(dot)ch>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: pid gets overwritten in OSX
Date: 2002-04-29 15:25:11
Message-ID: 23934.1020093911@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Francois Suter <dba(at)paragraf(dot)ch> writes:
> Maybe there's some operation missing at shutdown. I installed
> PostgreSQL using Mark Liyanage's package. Could there be something
> missing? Is Postgres taking care of the removal of the postmaster.pid
> file or do you have to do it yourself in some shutdown script?

No, you shouldn't need to do it yourself. The approved way to shut down
Pg is to send the postmaster a SIGTERM signal --- which I believe all
Unixen will do automatically during the shutdown sequence. What may be
happening is that the system is not giving the postmaster a long enough
grace period between SIGTERM and hard kill. We need a minimum of about
three seconds I believe (there's a 2-second sleep() in the checkpoint
sync code, which maybe should not be there, but it's there at the
moment). Traditionally systems have allowed 10 seconds or more to
respond to SIGTERM, but perhaps Apple thought they could shave some
time there?

regards, tom lane


From: tony <tony(at)animaproductions(dot)com>
To: Francois Suter <dba(at)paragraf(dot)ch>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject: Re: pid gets overwritten in OSX
Date: 2002-04-29 16:42:13
Message-ID: 1020098534.1767.60.camel@vaio
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Mon, 2002-04-29 at 17:05, Francois Suter wrote:
> >The real problem is that the old postmaster was evidently not allowed
> >to shut down cleanly (else it'd have removed its lockfile). How are
> >you powering down the system, anyway?
>
> I'm shutting down normally (ok, I mean most of the time I press the
> power-up button and choose "Shut down" rather than going via the
> Apple menu). I haven't had a system crash in ages! The only
> difference I can see (and I would have to test if it makes any
> difference) is that sometimes I'm working stand-alone at home and
> sometimes on the network in my office (I'm using a PowerBook G4), but
> I'm pretty sure I don't have this problem popping up everytime I go
> back to the office after having used my machine at home.
>
> Maybe there's some operation missing at shutdown. I installed
> PostgreSQL using Mark Liyanage's package. Could there be something
> missing? Is Postgres taking care of the removal of the postmaster.pid
> file or do you have to do it yourself in some shutdown script?

François

I would definitely quit postgres before shutting down. And Mac OS X does
not in my experience like working in "offline" mode. I had all sorts of
problems getting networking set up right in that mode. All my problems
disapeared when the machine was plugged in to the adsl router...

I would say that DNS could be an issue here.

Cheers

Tony Grant

--
RedHat Linux on Sony Vaio C1XD/S
http://www.animaproductions.com/linux2.html
Macromedia UltraDev with PostgreSQL
http://www.animaproductions.com/ultra.html


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: Francois Suter <dba(at)paragraf(dot)ch>
Subject: Mac OS X: system shutdown prevents checkpoint
Date: 2002-04-30 05:26:26
Message-ID: 17395.1020144386@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

I've been looking into Francois Suter's recent reports of Postgres not
shutting down cleanly on Mac OS X 10.1. I find that it's quite
reproducible. If you tell the system to shut down in the normal
fashion (eg, pick "Shut Down" from the Apple menu), the postmaster
does not terminate, leading to WAL recovery upon restart --- or
even worse, failure to restart if the postmaster PID recorded in the
lockfile happens to get assigned to some other daemon.

Observe the normal trace of postmaster shutdown (running with -d4,
logging of timestamps and PIDs enabled):

2002-04-30 00:08:30 [315] DEBUG: pmdie 15
2002-04-30 00:08:30 [315] DEBUG: smart shutdown request
2002-04-30 00:08:30 [331] DEBUG: shutting down
2002-04-30 00:08:32 [331] DEBUG: database system is shut down
2002-04-30 00:08:32 [331] DEBUG: proc_exit(0)
2002-04-30 00:08:32 [331] DEBUG: shmem_exit(0)
2002-04-30 00:08:32 [331] DEBUG: exit(0)
2002-04-30 00:08:32 [315] DEBUG: reaping dead processes
2002-04-30 00:08:32 [315] DEBUG: proc_exit(0)
2002-04-30 00:08:32 [315] DEBUG: shmem_exit(0)
2002-04-30 00:08:32 [315] DEBUG: exit(0)

The postmaster (here PID 315) forks a subprocess to flush shared buffers
and checkpoint the WAL log. When the subprocess exits, the postmaster
removes its lockfile and shuts down. The subprocess takes a minimum of
2 seconds because there's a sleep(2) in the checkpoint fsync code.

Now here's what I see in the case of shutting down the OS X system:

2002-04-30 00:25:35 [376] DEBUG: pmdie 15
2002-04-30 00:25:35 [376] DEBUG: smart shutdown request

... and nothing more. Actual system shutdown (power down) occurred at
approximately 00:26:06 by my watch, over thirty seconds later than the
postmaster received SIGTERM. So there was plenty of time to do the
checkpoint subprocess. (Indeed, I believe that thirty seconds is the
grace period Darwin's init process allows SIGTERM'd processes before
giving up and hard-killing them. So the system was actually sitting and
waiting for the postmaster.)

What we appear to have here is that the kernel is not allowing the
postmaster to fork a checkpoint subprocess. But there's no indication
that the postmaster got a fork() error return, either. Seems like it's
just hung.

Does this ring a bell with anyone? Is it an OSX bug, or a "feature";
and if the latter, how can we work around it?

regards, tom lane


From: "Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Cc: "Francois Suter" <dba(at)paragraf(dot)ch>
Subject: Re: Mac OS X: system shutdown prevents checkpoint
Date: 2002-04-30 06:30:19
Message-ID: GNELIHDDFBOCMGBFGEFOAEFICCAA.chriskl@familyhealth.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

I showed this to my friend who's a FreeBSD committer (Adrian Chadd) and he's
actually setting up a MacOS/X box at the moment and will look into it -
assuming you don't discover the problem first...

Chris

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org]On Behalf Of Tom Lane
> Sent: Tuesday, 30 April 2002 1:26 PM
> To: pgsql-hackers(at)postgresql(dot)org
> Cc: Francois Suter
> Subject: [HACKERS] Mac OS X: system shutdown prevents checkpoint
>
>
> I've been looking into Francois Suter's recent reports of Postgres not
> shutting down cleanly on Mac OS X 10.1. I find that it's quite
> reproducible. If you tell the system to shut down in the normal
> fashion (eg, pick "Shut Down" from the Apple menu), the postmaster
> does not terminate, leading to WAL recovery upon restart --- or
> even worse, failure to restart if the postmaster PID recorded in the
> lockfile happens to get assigned to some other daemon.
>
> Observe the normal trace of postmaster shutdown (running with -d4,
> logging of timestamps and PIDs enabled):
>
> 2002-04-30 00:08:30 [315] DEBUG: pmdie 15
> 2002-04-30 00:08:30 [315] DEBUG: smart shutdown request
> 2002-04-30 00:08:30 [331] DEBUG: shutting down
> 2002-04-30 00:08:32 [331] DEBUG: database system is shut down
> 2002-04-30 00:08:32 [331] DEBUG: proc_exit(0)
> 2002-04-30 00:08:32 [331] DEBUG: shmem_exit(0)
> 2002-04-30 00:08:32 [331] DEBUG: exit(0)
> 2002-04-30 00:08:32 [315] DEBUG: reaping dead processes
> 2002-04-30 00:08:32 [315] DEBUG: proc_exit(0)
> 2002-04-30 00:08:32 [315] DEBUG: shmem_exit(0)
> 2002-04-30 00:08:32 [315] DEBUG: exit(0)
>
> The postmaster (here PID 315) forks a subprocess to flush shared buffers
> and checkpoint the WAL log. When the subprocess exits, the postmaster
> removes its lockfile and shuts down. The subprocess takes a minimum of
> 2 seconds because there's a sleep(2) in the checkpoint fsync code.
>
> Now here's what I see in the case of shutting down the OS X system:
>
> 2002-04-30 00:25:35 [376] DEBUG: pmdie 15
> 2002-04-30 00:25:35 [376] DEBUG: smart shutdown request
>
> ... and nothing more. Actual system shutdown (power down) occurred at
> approximately 00:26:06 by my watch, over thirty seconds later than the
> postmaster received SIGTERM. So there was plenty of time to do the
> checkpoint subprocess. (Indeed, I believe that thirty seconds is the
> grace period Darwin's init process allows SIGTERM'd processes before
> giving up and hard-killing them. So the system was actually sitting and
> waiting for the postmaster.)
>
> What we appear to have here is that the kernel is not allowing the
> postmaster to fork a checkpoint subprocess. But there's no indication
> that the postmaster got a fork() error return, either. Seems like it's
> just hung.
>
> Does this ring a bell with anyone? Is it an OSX bug, or a "feature";
> and if the latter, how can we work around it?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org
>


From: Peter Bierman <bierman(at)apple(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Mac OS X: system shutdown prevents checkpoint
Date: 2002-05-01 21:51:52
Message-ID: v03130303b8f611e5d925@[17.202.21.231]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

At 1:26 AM -0400 4/30/02, Tom Lane wrote:
>I've been looking into Francois Suter's recent reports of Postgres not
>shutting down cleanly on Mac OS X 10.1.
>
>Now here's what I see in the case of shutting down the OS X system:
>
>2002-04-30 00:25:35 [376] DEBUG: pmdie 15
>2002-04-30 00:25:35 [376] DEBUG: smart shutdown request
>
>... and nothing more. Actual system shutdown (power down) occurred at
>approximately 00:26:06 by my watch, over thirty seconds later than the
>postmaster received SIGTERM. So there was plenty of time to do the
>checkpoint subprocess. (Indeed, I believe that thirty seconds is the
>grace period Darwin's init process allows SIGTERM'd processes before
>giving up and hard-killing them. So the system was actually sitting and
>waiting for the postmaster.)
>
>What we appear to have here is that the kernel is not allowing the
>postmaster to fork a checkpoint subprocess. But there's no indication
>that the postmaster got a fork() error return, either. Seems like it's
>just hung.

Unfortunately, I don't have time right now to look into this myself, and because I just moved, I don't have a machine I can give someone an account on to try it themselves (PacBell says 20 days for DSL xfer). But I asked around, and got a pair of tips from the Mac OS X Core OS group. If you want to converse with either of the people named below, they're both active on the darwin-development mailing list. (http://lists.apple.com/mailman/listinfo/darwin-development)

-pmb

At 1:52 PM -0700 5/1/02, Jim Magee wrote:
>On Wednesday, May 1, 2002, at 01:34 PM, Peter Bierman wrote:
>
>> Is fork() disallowed after shutdown starts?
>
>No, it's allowed. But, depending upon timing, the new process may be
>hammered with a SIGTERM right away (maybe even before main()). It is
>always very tricky to fork() as the result of a daemon getting a signal.
>They are often process group leader, and so their children may get the
>same signal they just got.
>
>POSIX is very ambiguous on whether a new process in the group should also
>get the signal while we're still delivering them, or whether it shouldn't
>because it wasn't in the group at the time the signal was first
>delivered). Both choices have their problems, and so developers have to
>deal with either case. Do you have signals masked off correctly before
>the fork()/exec()?
>
>Is fork really returning a PID in the parent, and it just looks like the
>child didn't make it to returning from its fork() call? There are some
>preparation things that happen in dyld and libc as part of returning fom
>fork in the child, and these run before we make it look like fork()
>returned in the child. If they encounter an error (maybe because the
>services they need to talk to are no longer available), they have nothing
>else to do but call _exit() - making it look like the child never returned
>from fork().
>
>But in either the dydl/libc exit case, or the signal case, the parent
>should get a wait result indicating why the child went away so
>prematurely. If is was an exit(), maybe using vfork() will yield better
>results, as there is no need for child-side setup in the vfork() case.
>
>--Jim

At 2:01 PM -0700 5/1/02, Matt Watson wrote:
>
>It could be that the child has blocked trying to contact a dead lookupd.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Bierman <bierman(at)apple(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Mac OS X: system shutdown prevents checkpoint
Date: 2002-05-02 04:45:19
Message-ID: 4752.1020314719@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Peter Bierman <bierman(at)apple(dot)com> writes:
> Is fork() disallowed after shutdown starts?
>>
>> No, it's allowed. But, depending upon timing, the new process may be
>> hammered with a SIGTERM right away (maybe even before main()).

Good point. The fork is executed with SIGTERM blocked --- but the
checkpoint child process currently will enable SIGTERM shortly after
being forked. On reflection that seems like a bad idea; probably the
checkpoint process should ignore SIGTERM so that it won't get killed
prematurely during system shutdown.

However, that doesn't explain our OS X problem. I added some debug
printouts, and can now report positively that (a) the fork() call
returns normally in the parent process, providing an apparently-correct
child PID value; but (b) the fork never returns in the child. It
doesn't ever get as far as trying to enable SIGTERM.

>> Is fork really returning a PID in the parent, and it just looks like the
>> child didn't make it to returning from its fork() call? There are some
>> preparation things that happen in dyld and libc as part of returning fom
>> fork in the child, and these run before we make it look like fork()
>> returned in the child. If they encounter an error (maybe because the
>> services they need to talk to are no longer available), they have nothing
>> else to do but call _exit() - making it look like the child never returned
>> from fork().

Hmmm ... that seems very close to what I'm seeing.

>> But in either the dydl/libc exit case, or the signal case, the parent
>> should get a wait result indicating why the child went away so
>> prematurely.

The parent is not getting any wait() result indicating that its child died.
(If it were, we'd not have the problem being complained of.)

Is it possible that something in the child's fork() processing will wait
around for a response from a service that's already died? Why is fork()
dependent on any outside service whatever --- isn't that a certain
recipe for system failures?

regards, tom lane


From: sugita(at)sra(dot)co(dot)jp
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: bierman(at)apple(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Mac OS X: system shutdown prevents checkpoint
Date: 2002-07-16 01:22:14
Message-ID: 20020716.102214.28791430.sugita@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: Thu, 02 May 2002 00:45:19 -0400

;;; However, that doesn't explain our OS X problem. I added some debug
;;; printouts, and can now report positively that (a) the fork() call
;;; returns normally in the parent process, providing an apparently-correct
;;; child PID value; but (b) the fork never returns in the child. It
;;; doesn't ever get as far as trying to enable SIGTERM.
&
;;; Is it possible that something in the child's fork() processing will wait
;;; around for a response from a service that's already died? Why is fork()
;;; dependent on any outside service whatever --- isn't that a certain
;;; recipe for system failures?

I asked Apple this issue. This is a bug of Mac OS X. The problem is registered
to their bug database for the appropriate eingineers for investigation.

Kenji Sugita