Buildfarm owners: check if your HEAD build is stuck

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Buildfarm owners: check if your HEAD build is stuck
Date: 2006-08-12 15:29:46
Message-ID: 27932.1155396586@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

A number of the buildfarm machines have been failing HEAD builds
at the "make check" stage since last night, with complaints like
this one from emu:

================== pgsql.21911/src/test/regress/log/postmaster.log ===================
FATAL: lock file "/tmp/.s.PGSQL.55678.lock" already exists
HINT: Is another postmaster (PID 23692) using socket file "/tmp/.s.PGSQL.55678"?

What's happened is that that GUC patch that was in the tree for a few
hours broke postmaster startup on some machines (for as-yet-unidentified
reasons). The postmaster does actually start and establish its
lockfiles, but it never gets to the stage of being able to accept
connections.

After the buildfarm script rm -rf's the build tree, the postmaster
process is still there but "disembodied" (its executable file is
probably gone, for example, or at least in the state of zero remaining
directory links). But it's still got that socket file and lockfile
in /tmp, and this prevents another postmaster from starting with the
same port number.

If you've got this situation, you'll need to do a manual "kill" on the
PID mentioned in the lock file before things will start working again.
(pg_ctl won't work because it looks for the data directory
postmaster.pid file, which is long gone.) More generally you might want
to look through a ps listing for unexpected postgres-owned processes.

I'm not sure whether there's anything much we can do to prevent such
problems in future. Maybe it'd be reasonable for pg_regress to do a
kill -9 on its postmaster child process if it gives up waiting for the
postmaster to accept connections.

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Buildfarm owners: check if your HEAD build is stuck
Date: 2006-08-12 21:43:22
Message-ID: 20060812214322.GA23000@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Aug 12, 2006 at 11:29:46AM -0400, Tom Lane wrote:
> What's happened is that that GUC patch that was in the tree for a few
> hours broke postmaster startup on some machines (for as-yet-unidentified
> reasons). The postmaster does actually start and establish its
> lockfiles, but it never gets to the stage of being able to accept
> connections.

I don't know if it's related, but coverity just started picking up a
use-after-free in parse_value() in guc.c.

At the end of the switch (case PGC_STRING) there's a free(newval)
followed by an assignment of newval to retval->stringval a few lines
further down. They mark it as line 3956 of revision 1.335.

It may not be possible though, coverity is not omnicient.

Hope this helps,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm owners: check if your HEAD build is stuck
Date: 2006-08-13 02:26:14
Message-ID: 44DE8DC6.2010903@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> A number of the buildfarm machines have been failing HEAD builds
> at the "make check" stage since last night, with complaints like
> this one from emu:
>
> ================== pgsql.21911/src/test/regress/log/postmaster.log ===================
> FATAL: lock file "/tmp/.s.PGSQL.55678.lock" already exists
> HINT: Is another postmaster (PID 23692) using socket file "/tmp/.s.PGSQL.55678"?
>
> What's happened is that that GUC patch that was in the tree for a few
> hours broke postmaster startup on some machines (for as-yet-unidentified
> reasons). The postmaster does actually start and establish its
> lockfiles, but it never gets to the stage of being able to accept
> connections.
>
> After the buildfarm script rm -rf's the build tree, the postmaster
> process is still there but "disembodied" (its executable file is
> probably gone, for example, or at least in the state of zero remaining
> directory links). But it's still got that socket file and lockfile
> in /tmp, and this prevents another postmaster from starting with the
> same port number.
>
> If you've got this situation, you'll need to do a manual "kill" on the
> PID mentioned in the lock file before things will start working again.
> (pg_ctl won't work because it looks for the data directory
> postmaster.pid file, which is long gone.) More generally you might want
> to look through a ps listing for unexpected postgres-owned processes.
>
> I'm not sure whether there's anything much we can do to prevent such
> problems in future. Maybe it'd be reasonable for pg_regress to do a
> kill -9 on its postmaster child process if it gives up waiting for the
> postmaster to accept connections.
>
>
>

That's amazingly ugly, and well diagnosed.

BTW, buildfarm processes would typically not be postgres owned, at least
not on my machines. I run either as myself or as a special buildfarm user.

I'm trying to think how we could harden the buildfarm script to avoid
such situations, although I am so far without any great revelations.

The idea of getting pg_regress to send a signal isn't bad - what if the
PID gets reused, since we know not all systems allocate PIDs in a
cyclical fashion?

Also, I see the pg-regress code has this comment:

/*
* Fail immediately if postmaster has exited
*
* XXX is there a way to do this on Windows?
*/

As I understand it, the way to do it is to call OpenProcess() - if that
succeeds then it is still there. I guess if needed we could even do that
in src/port/kill.c so that kill(pid,0) would work. But I would want
confirmation from the Windows gurus.

cheers

andrew


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm owners: check if your HEAD build is stuck
Date: 2006-08-13 18:01:37
Message-ID: 200608131801.k7DI1bc07466@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
-- Start of PGP signed section.
> On Sat, Aug 12, 2006 at 11:29:46AM -0400, Tom Lane wrote:
> > What's happened is that that GUC patch that was in the tree for a few
> > hours broke postmaster startup on some machines (for as-yet-unidentified
> > reasons). The postmaster does actually start and establish its
> > lockfiles, but it never gets to the stage of being able to accept
> > connections.
>
> I don't know if it's related, but coverity just started picking up a
> use-after-free in parse_value() in guc.c.
>
> At the end of the switch (case PGC_STRING) there's a free(newval)
> followed by an assignment of newval to retval->stringval a few lines
> further down. They mark it as line 3956 of revision 1.335.
>
> It may not be possible though, coverity is not omnicient.

Yes, that was the area of the problem. It has been redesigned.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm owners: check if your HEAD build is stuck
Date: 2006-08-13 20:12:31
Message-ID: 3265.1155499951@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> Tom Lane wrote:
>> I'm not sure whether there's anything much we can do to prevent such
>> problems in future. Maybe it'd be reasonable for pg_regress to do a
>> kill -9 on its postmaster child process if it gives up waiting for the
>> postmaster to accept connections.

> I'm trying to think how we could harden the buildfarm script to avoid
> such situations, although I am so far without any great revelations.
> The idea of getting pg_regress to send a signal isn't bad - what if the
> PID gets reused, since we know not all systems allocate PIDs in a
> cyclical fashion?

I think it'd be OK on Unix --- even if the PID has been reused by the
time pg_regress tries to kill the child, presumably the reuse would be
under a different userid and pg_regress wouldn't have permission to kill
it.

I am not clear on how to do something equivalent under Windows though.
We'd have a HANDLE not a PID coming back from spawn_process, so I
suppose there should not be a confusion-of-identity problem, but I don't
know what the syscall equivalent to "kill(pid, SIGKILL)" would be.
Another problem is that under Unix we will have the exact postmaster PID
to try to kill(), because (a) spawn_process uses execl() not system() to
invoke the sub-shell and (b) we tell the sub-shell to exec not just call
the postmaster. I think under Windows we probably have a HANDLE for an
instance of the command line processor, not the postmaster as such, and
so I'm worried that killing it would not kill the postmaster anyway.
Does Windows have a syscall that would say "kill this process and all
its children too"?

It may be worth doing the SIGKILL on Unix even if we don't have a
solution for Windows, but it'd be nice if to have a solution for
the Windows port too.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm owners: check if your HEAD build is stuck
Date: 2006-08-13 20:42:02
Message-ID: 5626.1155501722@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> It may be worth doing the SIGKILL on Unix even if we don't have a
> solution for Windows, but it'd be nice if to have a solution for
> the Windows port too.

I've applied a trivial patch to do the SIGKILL on non-Windows machines.
If any Windows gurus can make it work on Windows too, go for it.
I suspect that spawn_process() and the two kill() calls will all need
to be modified.

regards, tom lane