Re: Latch implementation that wakes on postmaster death on both win32 and Unix

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, Florian Pflug <fgp(at)phlo(dot)org>
Subject: Re: Latch implementation that wakes on postmaster death on both win32 and Unix
Date: 2011-07-04 21:11:05
Message-ID: CAEYLb_VRP5pDCGkwUr48zQEpCvrO2EsB0+s0-SsAij2wzaCfJA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4 July 2011 16:53, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Ok, here's a new patch, addressing the issues Fujii raised, and with a bunch
> of stylistic changes of my own. Also, I committed a patch to remove
> silent_mode, so the fork_process() changes are now gone. I'm going to sleep
> over this and review once again tomorrow, and commit if it still looks good
> to me and no-one else reports new issues.

Looks good.

> I don't like the names POSTMASTER_FD_WATCH and POSTMASTER_FD_OWN. At a quick
> glance, it's not at all clear which is which. I couldn't come up with better
> names, so for now I just added some comments to clarify that. I would find
> WRITE/READ more clear, but to make sense of that you need to how the pipe is
> used. Any suggestions or opinions on that?

We could bikeshed about that until the cows come home, but we're not
going to come up with names that make the purpose of each evident at a
glance - it's too involved. If we could, we would have thought of them
already. Besides, I've probably already written all the client code
those macros will ever have.

On 4 July 2011 17:36, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> On Jul4, 2011, at 17:53 , Heikki Linnakangas wrote:
>>>       Under Linux, select() may report a socket file descriptor as "ready for
>>>       reading",  while nevertheless a subsequent read blocks.  This could for
>>>       example happen when data has arrived but  upon  examination  has  wrong
>>>       checksum and is discarded.  There may be other circumstances in which a
>>>       file descriptor is spuriously reported as ready.  Thus it may be  safer
>>>       to use O_NONBLOCK on sockets that should not block.
>>
>> So in theory, on Linux you might WaitLatch might sometimes incorrectly return WL_POSTMASTER_DEATH. None of the callers check for WL_POSTMASTER_DEATH return code, they call PostmasterIsAlive() before assuming the postmaster has died, so that won't affect correctness at the moment. I doubt that scenario can even happen in our case, select() on a pipe that is never written to. But maybe we should add add an assertion to WaitLatch to assert that if select() reports that the postmaster pipe has been closed, PostmasterIsAlive() also returns false.
>
> The correct solution would be to read() from the pipe after select()
> returns, and only return WL_POSTMASTER_DEATCH if the read doesn't return
> EAGAIN. To prevent that read() from blocking if the read event was indeed
> spurious, O_NONBLOCK must be set on the pipe but that patch does that already.

Let's have some perspective on this. We're talking about a highly
doubtful chance that latches may wake when they shouldn't. Latches are
typically expected to wake up for a variety of reasons, and if that
occurred in the archiver's case with my patch applied, I think we'd
just go asleep again without anything happening. It seems likely that
latch client code in general will never trip up on this, as long as
its not exclusively relying on the waitLatch() return value to report
pm death.

Maybe we should restore the return value of WaitLatch to its previous
format (so it doesn't return a bitmask)? That way, we don't report
that the Postmaster died, and therefore clients are required to call
PostmasterIsAlive() to be sure. In any case, I'm in favour of the
assertion.

> Btw, with the death-watch / life-sign / whatever infrastructure in place,
> shouldn't PostmasterIsAlive() be using that instead of getppid() / kill(0)?

Hmm, maybe. That seems like a separate issue though, that can be
addressed with another patch. It does have the considerable
disadvantage of making Heikki's proposed assertion failure useless. Is
the implementation of PostmasterIsAlive() really a problem at the
moment?

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2011-07-04 21:30:44 Re: %ENV warnings during builds
Previous Message Heikki Linnakangas 2011-07-04 19:45:04 Re: Perl 5.8 requirement for compiling on windows?