Re: 9.4 HEAD: select() failed in postmaster

From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>
Cc: "Noah Misch" <noah(at)leadboat(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 HEAD: select() failed in postmaster
Date: 2013-09-13 08:00:58
Message-ID: 53F0692AB35345348E29FB46F33127E3@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

From: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>
--------------------------------------------------
I've implemented the Min to Max change and did some more testing. Now I
have a different but related problem (which I also saw before, but less
often than the select() one). The 5 second clock doesn't get turned off.
So after all processes end, and a new startup is launched, if that startup
doesn't report back to the postmaster soon enough, it gets SIGKILLED.

postmaster.c near line 1681

if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
now - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS)

It seems like this needs to have an additional and-test of pmState, but
which states to test I don't really know.

I've added in "&& (pmState>PM_RUN)" and have not had any more failures, so
I think that this is on the right path but testing an enum for inequality
feels wrong.
--------------------------------------------------

"AbortStartTime > 0" is also necessary to avoid sending SIGKILL repeatedly.
I sent the attached patch during the original discussion. The below
fragment is relevant:

--- 1663,1688 ----
TouchSocketLockFiles();
last_touch_time = now;
}
+
+ /*
+ * When postmaster got an immediate shutdown request
+ * or some child terminated abnormally (FatalError case),
+ * postmaster sends SIGQUIT to all children except
+ * syslogger and dead_end ones, then wait for them to terminate.
+ * If some children didn't terminate within a certain amount of time,
+ * postmaster sends SIGKILL to them and wait again.
+ * This resolves, for example, the hang situation where
+ * a backend gets stuck in the call chain:
+ * free() acquires some lock -> <received SIGQUIT> ->
+ * quickdie() -> ereport() -> gettext() -> malloc() -> <lock
acquisition>
+ */
+ if (AbortStartTime > 0 && /* SIGKILL only once */
+ (Shutdown == ImmediateShutdown || (FatalError && !SendStop)) &&
+ now - AbortStartTime >= 10)
+ {
+ SignalAllChildren(SIGKILL);
+ AbortStartTime = 0;
+ }
}
}

Regards
MauMau

Attachment Content-Type Size
reliable_immediate_shutdown.patch application/octet-stream 8.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Atri Sharma 2013-09-13 09:26:43 Re: Re: Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)
Previous Message Benedikt Grundmann 2013-09-13 07:03:17 Re: record identical operator