Re: [COMMITTERS] pgsql: Improve corner cases in pg_ctl's new wait-for-postmaster-startup

Lists: pgsql-committerspgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-committers(at)postgresql(dot)org
Subject: pgsql: Improve corner cases in pg_ctl's new wait-for-postmaster-startup
Date: 2011-05-27 18:13:46
Message-ID: E1QQ1Xe-00071t-V8@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Improve corner cases in pg_ctl's new wait-for-postmaster-startup code.

With "-w -t 0", we should report "still starting up", not "ok". If we
fall out of the loop without ever being able to call PQping (because we
were never able to construct a connection string), report "no response",
not "ok". This gets rid of corner cases in which we'd claim the server
had started even though it had not.

Also, if the postmaster.pid file is not there at any point after we've
waited 5 seconds, assume the postmaster has failed and report that, rather
than almost-certainly-fruitlessly continuing to wait. The pidfile should
appear almost instantly even when there is extensive startup work to do,
so 5 seconds is already a very conservative figure. This part is per a
gripe from MauMau --- there might be better ways to do it, but nothing
simple enough to get done for 9.1.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/0bae3bc9be4a025df089f0a0c2f547fa538a97bc

Modified Files
--------------
src/bin/pg_ctl/pg_ctl.c | 19 ++++++++++++++++++-
1 files changed, 18 insertions(+), 1 deletions(-)


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Improve corner cases in pg_ctl's new wait-for-postmaster-startup
Date: 2011-06-01 03:19:16
Message-ID: BANLkTi=xnvH+fn2Xnqi74MojDusPSp7K8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Sat, May 28, 2011 at 3:13 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Also, if the postmaster.pid file is not there at any point after we've
> waited 5 seconds, assume the postmaster has failed and report that, rather
> than almost-certainly-fruitlessly continuing to wait.  The pidfile should
> appear almost instantly even when there is extensive startup work to do,
> so 5 seconds is already a very conservative figure.  This part is per a
> gripe from MauMau --- there might be better ways to do it, but nothing
> simple enough to get done for 9.1.

The postmaster.pid can remain even after the server dies, for example,
in the case where it dies with PANIC or SIGKILL. To address this corner
case, we should check whether postmaster is really running by sending
the signal 0 after we read postmater.pid file? Attached patch does that.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
pg_ctl_change_v1.patch text/x-patch 483 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Improve corner cases in pg_ctl's new wait-for-postmaster-startup
Date: 2011-06-01 03:30:50
Message-ID: 7649.1306899050@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> On Sat, May 28, 2011 at 3:13 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Also, if the postmaster.pid file is not there at any point after we've
>> waited 5 seconds, assume the postmaster has failed and report that, rather
>> than almost-certainly-fruitlessly continuing to wait. The pidfile should
>> appear almost instantly even when there is extensive startup work to do,
>> so 5 seconds is already a very conservative figure. This part is per a
>> gripe from MauMau --- there might be better ways to do it, but nothing
>> simple enough to get done for 9.1.

> The postmaster.pid can remain even after the server dies, for example,
> in the case where it dies with PANIC or SIGKILL.

Yeah, there are other corner cases still not covered.

> To address this corner
> case, we should check whether postmaster is really running by sending
> the signal 0 after we read postmater.pid file? Attached patch does that.

I find myself unimpressed by this approach, because it supposes that the
postmaster got as far as creating postmaster.pid.

My preference for fixing this would be to arrange for the postmaster to
be the direct child of pg_ctl, so that we could watch for SIGCHLD to
detect premature postmaster exit. Right now that doesn't work because
we are invoking an intermediate shell via system(), but I think that
could be avoided with a bit more effort (ie, an explicit fork and exec).
Not sure how that all translates into Windows-ville, though.

regards, tom lane


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Improve corner cases in pg_ctl's new wait-for-postmaster-startup
Date: 2011-06-01 04:24:09
Message-ID: BANLkTikOwTEZYoYUq9jDC5HeA73BoSjMGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Wed, Jun 1, 2011 at 12:30 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> To address this corner
>> case, we should check whether postmaster is really running by sending
>> the signal 0 after we read postmater.pid file? Attached patch does that.
>
> I find myself unimpressed by this approach, because it supposes that the
> postmaster got as far as creating postmaster.pid.

Sorry, I could not understand the reason why you were unimpressed.
Could you explain it in a little more detail?

> My preference for fixing this would be to arrange for the postmaster to
> be the direct child of pg_ctl, so that we could watch for SIGCHLD to
> detect premature postmaster exit.  Right now that doesn't work because
> we are invoking an intermediate shell via system(), but I think that
> could be avoided with a bit more effort (ie, an explicit fork and exec).
> Not sure how that all translates into Windows-ville, though.

Yeah, it sounds difficult to implement that for Windows. pg_ctl might
need to invoke the thread for receiving the SIGCHLD, like the server
does.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Improve corner cases in pg_ctl's new wait-for-postmaster-startup
Date: 2011-06-01 14:51:57
Message-ID: 18441.1306939917@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> On Wed, Jun 1, 2011 at 12:30 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> To address this corner
>>> case, we should check whether postmaster is really running by sending
>>> the signal 0 after we read postmater.pid file? Attached patch does that.

>> I find myself unimpressed by this approach, because it supposes that the
>> postmaster got as far as creating postmaster.pid.

> Sorry, I could not understand the reason why you were unimpressed.
> Could you explain it in a little more detail?

[ thinks some more... ] Actually, there's more merit to your suggestion
than I saw at first, but it's still got an issue. We can divide
postmaster failures into four cases:
1. postmaster fails before creating postmaster.pid,
and there was no pre-existing postmaster.pid file
2. postmaster fails before creating postmaster.pid,
but there was a pre-existing postmaster.pid file
3. postmaster fails after creating postmaster.pid,
and successfully removes postmaster.pid
4. postmaster fails after creating postmaster.pid,
and fails to remove postmaster.pid
The current HEAD code will detect 1 and 3 (after 5 seconds), and will
detect case 2 by virtue of noticing a stale timestamp in the old
pidfile; but it will wait till timeout in case 4. If we add your
suggestion to what's there now, it will cover case 4. It doesn't cover
case 1, and might not cover case 3 (if the pidfile was there for so
short a time that we never saw it) but that really isn't a problem
because the existing timeout logic handles those cases.

The problem I've got with the proposed change is that it's brittle
against case 2: it might pick up a PID from a stale pidfile and then
conclude that the postmaster died, when actually the postmaster hasn't
yet written a new pidfile. However, the existing code is also brittle
in this case, because when it sees that the pidfile is stale, it
immediately fails.

I think we can make it better by simply ignoring a pidfile with a stale
timestamp (hoping for it to be overwritten), and remembering the PID to
try kill(pid, 0) on from the first time we successfully parse the file.

regards, tom lane