9.4 HEAD: select() failed in postmaster

Lists: pgsql-hackers
From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: 9.4 HEAD: select() failed in postmaster
Date: 2013-09-11 00:18:21
Message-ID: CAMkU=1xd3=wFqZwwuXPWe4BQs3h1seYo8LV9JtSjW5RodoPxMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been getting some failures after an immediate shutdown or crash,
during severe IO stress, with the message:

LOG: XX000: select() failed in postmaster: Invalid argument
LOCATION: ServerLoop, postmaster.c:1560

It is trying to sleep for -1 seconds.

I think the problem is here, where there should be a Max rather than a Min:

commit 82233ce7ea42d6ba519aaec63008aff49da6c7af
Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Date: Fri Jun 28 17:20:53 2013 -0400

Send SIGKILL to children if they don't die quickly in immediate shutdown

...

+ /* remaining time, but at least 1 second */
+ timeout->tv_sec = Min(SIGKILL_CHILDREN_AFTER_SECS -
+ (time(NULL) - AbortStartTime), 1);

But I don't understand the logic behind this anyway. Why sleep at least 1
second? If time is up, it is up, why not use zero as the minimum?

Cheers,

Jeff


From: Noah Misch <noah(at)leadboat(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, alvherre(at)2ndquadrant(dot)com
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 HEAD: select() failed in postmaster
Date: 2013-09-11 23:54:19
Message-ID: 20130911235419.GA243762@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 10, 2013 at 05:18:21PM -0700, Jeff Janes wrote:
> I've been getting some failures after an immediate shutdown or crash,
> during severe IO stress, with the message:
>
> LOG: XX000: select() failed in postmaster: Invalid argument
> LOCATION: ServerLoop, postmaster.c:1560
>
> It is trying to sleep for -1 seconds.
>
> I think the problem is here, where there should be a Max rather than a Min:
>
> commit 82233ce7ea42d6ba519aaec63008aff49da6c7af
> Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
> Date: Fri Jun 28 17:20:53 2013 -0400
>
> Send SIGKILL to children if they don't die quickly in immediate shutdown
>
> ...
>
> + /* remaining time, but at least 1 second */
> + timeout->tv_sec = Min(SIGKILL_CHILDREN_AFTER_SECS -
> + (time(NULL) - AbortStartTime), 1);

Agreed; good catch.

> But I don't understand the logic behind this anyway. Why sleep at least 1
> second? If time is up, it is up, why not use zero as the minimum?

Offhand, clamping to zero does make more sense to me. It looks like Alvaro
added that bit in his pre-commit edits. Alvaro?

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 HEAD: select() failed in postmaster
Date: 2013-09-12 01:04:06
Message-ID: 20130912010406.GE12028@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Noah Misch escribió:
> On Tue, Sep 10, 2013 at 05:18:21PM -0700, Jeff Janes wrote:

> > I think the problem is here, where there should be a Max rather than a Min:
> >
> > commit 82233ce7ea42d6ba519aaec63008aff49da6c7af
> > Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
> > Date: Fri Jun 28 17:20:53 2013 -0400
> >
> > Send SIGKILL to children if they don't die quickly in immediate shutdown
> >
> > ...
> >
> > + /* remaining time, but at least 1 second */
> > + timeout->tv_sec = Min(SIGKILL_CHILDREN_AFTER_SECS -
> > + (time(NULL) - AbortStartTime), 1);
>
> Agreed; good catch.

Yeah, thanks. Should be a Max(). The current coding presumably makes
it use one second most of the time, instead of whatever the remaining
time is ... until the abort time is past, in which case it causes the
whole thing to break down as reported.

It might very well be that I used Max() there initially and changed to
Min() at the last minute before commit in a moment of brain fade.

> > But I don't understand the logic behind this anyway. Why sleep at least 1
> > second? If time is up, it is up, why not use zero as the minimum?
>
> Offhand, clamping to zero does make more sense to me. It looks like Alvaro
> added that bit in his pre-commit edits. Alvaro?

Sadly, I don't have the developing branch for this feature anymore, so I
have to go from memory. IIRC my thinking here is that if I make select
terminate immediately (timeout 0) then the time arithmetic in
ServerLoop() might lead us to decide not to send SIGKILL at that time,
causing one more iteration of that loop. Thinking about it again, that
argument doesn't seem to hold much water; but the time variables being
in integer seconds led me to add that.

I will fix it to Max( ..., 0).

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: 9.4 HEAD: select() failed in postmaster
Date: 2013-09-13 00:13:59
Message-ID: CAMkU=1zqrj-r4u0EMWUzUbrAbnRBwi-SHsVf=xU7VvZzUu4zyg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday, September 11, 2013, Alvaro Herrera wrote:

> Noah Misch escribió:
> > On Tue, Sep 10, 2013 at 05:18:21PM -0700, Jeff Janes wrote:
>
> > > I think the problem is here, where there should be a Max rather than a
> Min:
> > >
> > > commit 82233ce7ea42d6ba519aaec63008aff49da6c7af
> > > Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
> > > Date: Fri Jun 28 17:20:53 2013 -0400
> > >
> > > Send SIGKILL to children if they don't die quickly in immediate
> shutdown
> > >
> > > ...
> > >
> > > + /* remaining time, but at least 1 second */
> > > + timeout->tv_sec = Min(SIGKILL_CHILDREN_AFTER_SECS -
> > > + (time(NULL) - AbortStartTime), 1);
> >
> > Agreed; good catch.
>
> Yeah, thanks. Should be a Max(). The current coding presumably makes
> it use one second most of the time, instead of whatever the remaining
> time is ... until the abort time is past, in which case it causes the
> whole thing to break down as reported.
>
> It might very well be that I used Max() there initially and changed to
> Min() at the last minute before commit in a moment of brain fade.
>

I've implemented the Min to Max change and did some more testing. Now I
have a different but related problem (which I also saw before, but less
often than the select() one). The 5 second clock doesn't get turned off.
So after all processes end, and a new startup is launched, if that startup
doesn't report back to the postmaster soon enough, it gets SIGKILLED.

postmaster.c near line 1681

if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
now - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS)

It seems like this needs to have an additional and-test of pmState, but
which states to test I don't really know.

I've added in "&& (pmState>PM_RUN)" and have not had any more failures, so
I think that this is on the right path but testing an enum for inequality
feels wrong.

Alternatively perhaps FatalError can get cleared when startup is launched,
rather than when WAL replay begins. But I assume it was done the way it is
for a reason, even though I don't know that reason.

Cheers,

Jeff

>
>


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>
Cc: "Noah Misch" <noah(at)leadboat(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 HEAD: select() failed in postmaster
Date: 2013-09-13 08:00:58
Message-ID: 53F0692AB35345348E29FB46F33127E3@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>
--------------------------------------------------
I've implemented the Min to Max change and did some more testing. Now I
have a different but related problem (which I also saw before, but less
often than the select() one). The 5 second clock doesn't get turned off.
So after all processes end, and a new startup is launched, if that startup
doesn't report back to the postmaster soon enough, it gets SIGKILLED.

postmaster.c near line 1681

if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
now - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS)

It seems like this needs to have an additional and-test of pmState, but
which states to test I don't really know.

I've added in "&& (pmState>PM_RUN)" and have not had any more failures, so
I think that this is on the right path but testing an enum for inequality
feels wrong.
--------------------------------------------------

"AbortStartTime > 0" is also necessary to avoid sending SIGKILL repeatedly.
I sent the attached patch during the original discussion. The below
fragment is relevant:

--- 1663,1688 ----
TouchSocketLockFiles();
last_touch_time = now;
}
+
+ /*
+ * When postmaster got an immediate shutdown request
+ * or some child terminated abnormally (FatalError case),
+ * postmaster sends SIGQUIT to all children except
+ * syslogger and dead_end ones, then wait for them to terminate.
+ * If some children didn't terminate within a certain amount of time,
+ * postmaster sends SIGKILL to them and wait again.
+ * This resolves, for example, the hang situation where
+ * a backend gets stuck in the call chain:
+ * free() acquires some lock -> <received SIGQUIT> ->
+ * quickdie() -> ereport() -> gettext() -> malloc() -> <lock
acquisition>
+ */
+ if (AbortStartTime > 0 && /* SIGKILL only once */
+ (Shutdown == ImmediateShutdown || (FatalError && !SendStop)) &&
+ now - AbortStartTime >= 10)
+ {
+ SignalAllChildren(SIGKILL);
+ AbortStartTime = 0;
+ }
}
}

Regards
MauMau

Attachment Content-Type Size
reliable_immediate_shutdown.patch application/octet-stream 8.9 KB

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 HEAD: select() failed in postmaster
Date: 2013-10-01 21:36:02
Message-ID: 20131001213602.GM5235@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

MauMau escribió:

> "AbortStartTime > 0" is also necessary to avoid sending SIGKILL
> repeatedly. I sent the attached patch during the original
> discussion. The below fragment is relevant:

Can you please send a fixup patch to what's already committed?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>
Cc: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Noah Misch" <noah(at)leadboat(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 HEAD: select() failed in postmaster
Date: 2013-10-03 12:34:24
Message-ID: AD52C965C5CC46A892B118593CFB4453@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>
> Can you please send a fixup patch to what's already committed?

OK, I'll send a patch against HEAD, which will be a few lines. Am I
understanding the meaning of "fixup patch"?

Regards
MauMau


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 HEAD: select() failed in postmaster
Date: 2013-10-03 13:23:26
Message-ID: 20131003132326.GQ5235@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

MauMau escribió:
> From: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>
> >Can you please send a fixup patch to what's already committed?
>
> OK, I'll send a patch against HEAD, which will be a few lines. Am I
> understanding the meaning of "fixup patch"?

Yep, thanks.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>
Cc: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Noah Misch" <noah(at)leadboat(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 HEAD: select() failed in postmaster
Date: 2013-10-04 14:31:48
Message-ID: CD1D8FF5ED0B4E35AABE639EF0AC26BB@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> From: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>
>> >Can you please send a fixup patch to what's already committed?
>>
>> OK, I'll send a patch against HEAD, which will be a few lines. Am I
>> understanding the meaning of "fixup patch"?
>
> Yep, thanks.

Please use the attached patch. Thanks.

Regards
MauMau

Attachment Content-Type Size
reliable_immediate_shutdown_fixup.patch application/octet-stream 2.0 KB

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 HEAD: select() failed in postmaster
Date: 2013-10-06 02:54:38
Message-ID: 20131006025438.GA6135@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

MauMau escribió:
> >>From: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>
> >>>Can you please send a fixup patch to what's already committed?
> >>
> >>OK, I'll send a patch against HEAD, which will be a few lines. Am I
> >>understanding the meaning of "fixup patch"?
> >
> >Yep, thanks.
>
> Please use the attached patch. Thanks.

Pushed, thanks.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 HEAD: select() failed in postmaster
Date: 2013-10-21 17:45:36
Message-ID: CAMkU=1x+HENHnyQThb2GDKjy==Z0G1S8mue_Q3idtWEQb22JzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Oct 5, 2013 at 7:54 PM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>wrote:

> MauMau escribió:
> > >>From: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>
> > >>>Can you please send a fixup patch to what's already committed?
> > >>
> > >>OK, I'll send a patch against HEAD, which will be a few lines. Am I
> > >>understanding the meaning of "fixup patch"?
> > >
> > >Yep, thanks.
> >
> > Please use the attached patch. Thanks.
>
> Pushed, thanks.
>
>
Thanks everyone. I'm now unable to break it.

Cheers,

Jeff