Re: Latch implementation that wakes on postmaster death on both win32 and Unix

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, Florian Pflug <fgp(at)phlo(dot)org>
Subject: Re: Latch implementation that wakes on postmaster death on both win32 and Unix
Date: 2011-06-22 12:11:37
Message-ID: BANLkTinB63u3vDEaqoD+Ls2ag++kh0+d2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached patch addresses Fujii's more recent concerns.

On 22 June 2011 04:54, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> +WaitLatch(volatile Latch *latch, int wakeEvents, long timeout)
> +WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket
> sock, long timeout)
>
> If 'wakeEvent' is zero, we cannot get out of WaitLatch(). Something like
> Assert(waitEvents != 0) is required? Or, WaitLatch() should always wait
> on latch even when 'waitEvents' is zero?

Well, not waking when the client has not specified an event to wake on
is the correct thing to do in that case. It would also be inherently
undesirable, so I'd be happy to guard against it using an assertion.
Both implementations now use one.

> In unix_latch.c, select() in WaitLatchOrSocket() checks the timeout only when
> WL_TIMEOUT is set in 'wakeEvents'. OTOH, in win32_latch.c,
> WaitForMultipleObjects()
> in WaitLatchOrSocket() always checks the timeout even if WL_TIMEOUT is not
> given. Is this intentional?

No, it's a mistake. Fixed.

> +               else if (rc == WAIT_OBJECT_0 + 2 &&
> +                                ((wakeEvents & WL_SOCKET_READABLE) || (wakeEvents & WL_SOCKET_WRITEABLE)))
>
> Another corner case: when WL_POSTMASTER_DEATH and WL_SOCKET_READABLE
> are given and 'sock' is set to PGINVALID_SOCKET, we can wrongly pass through the
> above check. If this OK?

I see your point - Assert(sock != PGINVALID_SOCKET) could be violated.
We fix the issue now by setting and checking a bool that simply
indicates that we're interested in sockets.

>                rc = WaitForMultipleObjects(numevents, events, FALSE,
>                                                           (timeout >= 0) ? (timeout / 1000) : INFINITE);
> -               if (rc == WAIT_FAILED)
> +               if ( (wakeEvents & WL_POSTMASTER_DEATH) &&
> +                        !PostmasterIsAlive(true))
>
> After WaitForMultipleObjects() detects the death of postmaster,
> WaitForSingleObject()
> is called in PostmasterIsAlive(). In this case, what code does
> WaitForSingleObject() return?
> I wonder if WaitForSingleObject() returns the code other than
> WAIT_TIMEOUT and really
> can detect the death of postmaster.

As noted up-thread, the fact that the archiver does wake and finish on
Postmaster death can be clearly observed on Windows. I'm not sure why
you wonder that, as this is fairly standard use of
PostmasterIsAlive(). I've verified that the waitLatch() call
correctly reports Postmaster death in its return code on Windows, and
indeed that it actually wakes up.

Are you suggesting that there should be a defensive else if{ } for the
case where PostmasterIsAlive() incorrectly reports that the PM is
alive due to some implementation related race-condition, and we've
already considered every other possibility? Well, I suppose that's not
necessary, because we will loop until we find a reason - it's okay to
miss it the first time around, because whatever caused
WaitForMultipleObjects() to wake up will cause it to immediately
return for the next iteration.

In any case, we don't rely on the PostmasterIsAlive() call at all
anymore, so it doesn't matter. We just look at rc's value now, as we
do for every other case, though it's a bit trickier when checking
Postmaster death. Similarly, we don't have a PostmasterIsAlive() call
within the unix latch implementation anymore.

> +       if (fcntl(postmaster_alive_fds[POSTMASTER_FD_WATCH], F_GETFL) < 0)
> +       {
> +               ereport(FATAL,
> +                       (errcode_for_socket_access(),
> +                        errmsg("failed to set the postmaster death watching fd's flags:
> %s", strerror(errno))));
> +       }
>
> Is the above check really required? It's harmless, but looks unnecessary.

Yes, it's not possible for it to detect an error condition now. Removed.

> '%m' should be used instead of '%s' and 'strerror(errno)'.

It is of course better to use the simpler, built-in facility. Fixed.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment Content-Type Size
new_latch.v6.patch text/x-patch 23.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Radosław Smogura 2011-06-22 12:31:01 Re: Hugetables question
Previous Message Marti Raudsepp 2011-06-22 11:24:17 Re: Hugetables question