Re: Buildfarm owners: check if your HEAD build is stuck

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Buildfarm owners: check if your HEAD build is stuck
Date: 2006-08-13 02:26:14
Message-ID: 44DE8DC6.2010903@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> A number of the buildfarm machines have been failing HEAD builds
> at the "make check" stage since last night, with complaints like
> this one from emu:
>
> ================== pgsql.21911/src/test/regress/log/postmaster.log ===================
> FATAL: lock file "/tmp/.s.PGSQL.55678.lock" already exists
> HINT: Is another postmaster (PID 23692) using socket file "/tmp/.s.PGSQL.55678"?
>
> What's happened is that that GUC patch that was in the tree for a few
> hours broke postmaster startup on some machines (for as-yet-unidentified
> reasons). The postmaster does actually start and establish its
> lockfiles, but it never gets to the stage of being able to accept
> connections.
>
> After the buildfarm script rm -rf's the build tree, the postmaster
> process is still there but "disembodied" (its executable file is
> probably gone, for example, or at least in the state of zero remaining
> directory links). But it's still got that socket file and lockfile
> in /tmp, and this prevents another postmaster from starting with the
> same port number.
>
> If you've got this situation, you'll need to do a manual "kill" on the
> PID mentioned in the lock file before things will start working again.
> (pg_ctl won't work because it looks for the data directory
> postmaster.pid file, which is long gone.) More generally you might want
> to look through a ps listing for unexpected postgres-owned processes.
>
> I'm not sure whether there's anything much we can do to prevent such
> problems in future. Maybe it'd be reasonable for pg_regress to do a
> kill -9 on its postmaster child process if it gives up waiting for the
> postmaster to accept connections.
>
>
>

That's amazingly ugly, and well diagnosed.

BTW, buildfarm processes would typically not be postgres owned, at least
not on my machines. I run either as myself or as a special buildfarm user.

I'm trying to think how we could harden the buildfarm script to avoid
such situations, although I am so far without any great revelations.

The idea of getting pg_regress to send a signal isn't bad - what if the
PID gets reused, since we know not all systems allocate PIDs in a
cyclical fashion?

Also, I see the pg-regress code has this comment:

/*
* Fail immediately if postmaster has exited
*
* XXX is there a way to do this on Windows?
*/

As I understand it, the way to do it is to call OpenProcess() - if that
succeeds then it is still there. I guess if needed we could even do that
in src/port/kill.c so that kill(pid,0) would work. But I would want
confirmation from the Windows gurus.

cheers

andrew

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message AgentM 2006-08-13 02:44:13 Re: [PATCHES] Adding fulldisjunctions to the contrib
Previous Message Sergey E. Koposov 2006-08-13 02:14:22 segfault on rollback