Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
Date: 2013-06-20 16:33:25
Message-ID: 20130620163325.GC4724@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

MauMau escribió:
> First, thank you for the review.
>
> From: "Alvaro Herrera" <alvherre(at)2ndquadrant(dot)com>
> >This seems reasonable. Why 10 seconds? We could wait 5 seconds, or 15.
> >Is there a rationale behind the 10? If we said 60, that would fit
> >perfectly well within the already existing 60-second loop in postmaster,
> >but that seems way too long.
>
> There is no good rationale. I arbitrarily chose a short period
> because this is "immediate" shutdown. I felt more than 10 second
> was long. I think 5 second may be better. Although not directly
> related to this fix, these influenced my choice:
>
> 1. According to the man page of init, init sends SIGKILL to all
> remaining processes 5 seconds after it sends SIGTERM to them.
>
> 2. At computer shutdown, Windows proceeds shutdown forcibly after
> waiting for services to terminate 20 seconds.

Well, as for the first of these, I don't think it matters whether
processes get their SIGKILL from postmaster or from init, when a
shutdown or runlevel is changing. Now, one would think that this sets a
precedent. I think 20 seconds might be too long; perhaps it's expected
in an operating system that forcibly has a GUI. I feel safe to ignore
that.

I will go with 5 seconds, then.

> >Note that the previous text said that postmaster will send SIGQUIT, then
> >terminate without checking anything. In the new code, postmaster sends
> >SIGQUIT, then waits, then SIGKILL, then waits again. If there's an
> >unkillable process (say because it's stuck in a noninterruptible sleep)
> >postmaster might never exit. I think it should send SIGQUIT, then wait,
> >then SIGKILL, then exit without checking.
>
> At first I thought the same, but decided not to do that. The
> purpose of this patch is to make the immediate shutdown "reliable".

My point is that there is no difference. For one thing, once we enter
immediate shutdown state, and sigkill has been sent, no further action
is taken. Postmaster will just sit there indefinitely until processes
are gone. If we were to make it repeat SIGKILL until they die, that
would be different. However, repeating SIGKILL is pointless, because it
they didn't die when they first received it, they will still not die
when they receive it second. Also, if they're in uninterruptible sleep
and don't die, then they will die as soon as they get out of that state;
no further queries will get processed, no further memory access will be
done. So there's no harm in they remaining there until underlying
storage returns to life, ISTM.

> Here, "reliable" means that the database server is certainly shut
> down when pg_ctl returns, not telling a lie that "I shut down the
> server processes for you, so you do not have to be worried that some
> postgres process might still remain and write to disk". I suppose
> reliable shutdown is crucial especially in HA cluster. If pg_ctl
> stop -mi gets stuck forever when there is an unkillable process (in
> what situations does this happen? OS bug, or NFS hard mount?), I
> think the DBA has to notice this situation from the unfinished
> pg_ctl, investigate the cause, and take corrective action.

So you're suggesting that keeping postmaster up is a useful sign that
the shutdown is not going well? I'm not really sure about this. What
do others think?

> >I have tweaked the patch a bit and I'm about to commit as soon as we
> >resolve the above two items.
>
> I appreciate your tweaking, especially in the documentation and
> source code comments, as English is not my mother tongue.

IIRC the only other interesting tweak I did was rename the
SignalAllChildren() function to TerminateChildren(). I did this because
it doesn't really signal all children; syslogger and dead_end backends
are kept around. So the original name was a bit misleading. And we
couldn't really name it SignalAlmostAllChildren(), could we ..

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2013-06-20 17:46:38 Re: pgbench --startup option
Previous Message Peter Eisentraut 2013-06-20 16:24:05 Re: Patch to add support of "IF NOT EXISTS" to others "CREATE" statements