Re: [bug fix] "pg_ctl stop" times out when it should respond quickly

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "MauMau" <maumau307(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [bug fix] "pg_ctl stop" times out when it should respond quickly
Date: 2013-12-03 22:35:29
Message-ID: 30805.1386110129@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"MauMau" <maumau307(at)gmail(dot)com> writes:
> The problem occurs in the sequence below:

> 1. postmaster creates $PGDATA/postmaster.pid.
> 2. postmaster tries to resolve the value of listen_addresses to IP
> addresses. This took about 15 seconds in my failure scenario.
> 3. During 2, pg_ctl sends SIGTERM to postmaster.
> 4. postmaster terminates immediately without deleting
> $PGDATA/postmaster.pid. This is because it hasn't set signal handlers yet.
> 5. "pg_ctl stop" waits in a loop until $PGDATA/postmaster.pid disappears.
> But the file does not disappear and it times out.

Hm. I wonder if we shouldn't block SIGTERM etc. earlier. It hardly seems
improbable that such signals would arrive during a slow startup.

> *** 907,913 ****
>
> for (cnt = 0; cnt < wait_seconds; cnt++)
> {
> ! if ((pid = get_pgpid()) != 0)
> {
> print_msg(".");
> pg_usleep(1000000); /* 1 sec */
> --- 907,914 ----
>
> for (cnt = 0; cnt < wait_seconds; cnt++)
> {
> ! if ((pid = get_pgpid()) != 0 &&
> ! postmaster_is_alive((pid_t) pid))
> {
> print_msg(".");
> pg_usleep(1000000); /* 1 sec */

If you're going to do a postmaster_is_alive check, why bother with
repeated get_pgpid()?

I think the reason why it was coded like that was that we hadn't written
postmaster_is_alive() yet, or maybe we had but didn't want to trust it.
However, with the coding you have here, we're fully exposed to any failure
modes postmaster_is_alive() may have; so there's not a lot of value in
accepting those and get_pgpid's failure modes too.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2013-12-03 22:44:13 Re: Why we are going to have to go DirectIO
Previous Message Tom Lane 2013-12-03 22:04:48 Re: pgsql: Fix a couple of bugs in MultiXactId freezing