9.4 HEAD: select() failed in postmaster

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: 9.4 HEAD: select() failed in postmaster
Date: 2013-09-13 00:13:59
Message-ID: CAMkU=1zqrj-r4u0EMWUzUbrAbnRBwi-SHsVf=xU7VvZzUu4zyg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wednesday, September 11, 2013, Alvaro Herrera wrote:

> Noah Misch escribió:
> > On Tue, Sep 10, 2013 at 05:18:21PM -0700, Jeff Janes wrote:
>
> > > I think the problem is here, where there should be a Max rather than a
> Min:
> > >
> > > commit 82233ce7ea42d6ba519aaec63008aff49da6c7af
> > > Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
> > > Date: Fri Jun 28 17:20:53 2013 -0400
> > >
> > > Send SIGKILL to children if they don't die quickly in immediate
> shutdown
> > >
> > > ...
> > >
> > > + /* remaining time, but at least 1 second */
> > > + timeout->tv_sec = Min(SIGKILL_CHILDREN_AFTER_SECS -
> > > + (time(NULL) - AbortStartTime), 1);
> >
> > Agreed; good catch.
>
> Yeah, thanks. Should be a Max(). The current coding presumably makes
> it use one second most of the time, instead of whatever the remaining
> time is ... until the abort time is past, in which case it causes the
> whole thing to break down as reported.
>
> It might very well be that I used Max() there initially and changed to
> Min() at the last minute before commit in a moment of brain fade.
>

I've implemented the Min to Max change and did some more testing. Now I
have a different but related problem (which I also saw before, but less
often than the select() one). The 5 second clock doesn't get turned off.
So after all processes end, and a new startup is launched, if that startup
doesn't report back to the postmaster soon enough, it gets SIGKILLED.

postmaster.c near line 1681

if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
now - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS)

It seems like this needs to have an additional and-test of pmState, but
which states to test I don't really know.

I've added in "&& (pmState>PM_RUN)" and have not had any more failures, so
I think that this is on the right path but testing an enum for inequality
feels wrong.

Alternatively perhaps FatalError can get cleared when startup is launched,
rather than when WAL replay begins. But I assume it was done the way it is
for a reason, even though I don't know that reason.

Cheers,

Jeff

>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sawada Masahiko 2013-09-13 01:17:24 Re: Patch for fail-back without fresh backup
Previous Message Kevin Grittner 2013-09-12 22:27:27 record identical operator