"pgstat wait timeout" just got a lot more common on Windows

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: "pgstat wait timeout" just got a lot more common on Windows
Date: 2012-05-09 14:31:12
Message-ID: 20081.1336573872@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Last night I changed the stats collector process to use
WaitLatchOrSocket instead of a periodic forced wakeup to see whether
the postmaster has died. This morning I observe that several Windows
buildfarm members are showing regression test failures caused by
unexpected "pgstat wait timeout" warnings. Everybody else is fine.

This suggests that there is something broken in the Windows
implementation of WaitLatchOrSocket. I wonder whether it also
tells us something we did not know about the underlying cause of
those messages. Not sure what though. Ideas? Can anyone who
knows Windows take another look at WaitLatchOrSocket?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: "pgstat wait timeout" just got a lot more common on Windows
Date: 2012-05-10 14:58:26
Message-ID: 426.1336661906@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Last night I changed the stats collector process to use
> WaitLatchOrSocket instead of a periodic forced wakeup to see whether
> the postmaster has died. This morning I observe that several Windows
> buildfarm members are showing regression test failures caused by
> unexpected "pgstat wait timeout" warnings. Everybody else is fine.

> This suggests that there is something broken in the Windows
> implementation of WaitLatchOrSocket. I wonder whether it also
> tells us something we did not know about the underlying cause of
> those messages. Not sure what though. Ideas? Can anyone who
> knows Windows take another look at WaitLatchOrSocket?

Anybody have any clues about that? If not, I think I'll have to revert
the pgstat changes for beta1, which isn't really forward progress.

I spent some time staring at the Windows WaitLatchOrSocket code myself.
The only thing I could find that seemed wrong is that in the event
array, we list the latch's event before pgwin32_signal_event. The
Microsoft documentation I looked at says that if more than one event
is ready, WaitforMultipleObjects reports the first such array member.
This means that if the latch is already set when control gets here,
signal handlers will not be serviced. That doesn't match what would
happen on a Unix machine, so it seems like at least a violation of the
POLA. Hence I think we oughta swap the order of those two array
elements. (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
pgwin32_select.) I do not however see a way that that would explain the
pgstat failures, because the stats collector's latch really shouldn't
ever get set during normal regression test runs.

regards, tom lane


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: "pgstat wait timeout" just got a lot more common on Windows
Date: 2012-05-10 15:27:19
Message-ID: CABUevEwoivuFOkyWPBP=rWKywd1OxC7aROG1xfyULv5hqQnXkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On May 10, 2012 4:59 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I wrote:
> > Last night I changed the stats collector process to use
> > WaitLatchOrSocket instead of a periodic forced wakeup to see whether
> > the postmaster has died. This morning I observe that several Windows
> > buildfarm members are showing regression test failures caused by
> > unexpected "pgstat wait timeout" warnings. Everybody else is fine.
>
> > This suggests that there is something broken in the Windows
> > implementation of WaitLatchOrSocket. I wonder whether it also
> > tells us something we did not know about the underlying cause of
> > those messages. Not sure what though. Ideas? Can anyone who
> > knows Windows take another look at WaitLatchOrSocket?
>
> Anybody have any clues about that? If not, I think I'll have to revert
> the pgstat changes for beta1, which isn't really forward progress.

Haven't had time to look at the code itself, and won't before wrap time.
Sorry.

> I spent some time staring at the Windows WaitLatchOrSocket code myself.
> The only thing I could find that seemed wrong is that in the event
> array, we list the latch's event before pgwin32_signal_event. The
> Microsoft documentation I looked at says that if more than one event
> is ready, WaitforMultipleObjects reports the first such array member.
> This means that if the latch is already set when control gets here,
> signal handlers will not be serviced.

Yeah, that does seem wrong.

> That doesn't match what would
> happen on a Unix machine, so it seems like at least a violation of the
> POLA. Hence I think we oughta swap the order of those two array
> elements. (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
> pgwin32_select.) I do not however

Maybe we need a loop that checks for all events?

> see a way that that would explain the
> pgstat failures, because the stats collector's latch really shouldn't
> ever get set during normal regression test runs.

So could there be something wrong in the other end, meaning the latch
*does* get set?

/Magnus


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: "pgstat wait timeout" just got a lot more common on Windows
Date: 2012-05-10 15:37:36
Message-ID: 1735.1336664256@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander <magnus(at)hagander(dot)net> writes:
> On May 10, 2012 4:59 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I spent some time staring at the Windows WaitLatchOrSocket code myself.
>> The only thing I could find that seemed wrong is that in the event
>> array, we list the latch's event before pgwin32_signal_event. The
>> Microsoft documentation I looked at says that if more than one event
>> is ready, WaitforMultipleObjects reports the first such array member.
>> This means that if the latch is already set when control gets here,
>> signal handlers will not be serviced.

> Yeah, that does seem wrong.

>> That doesn't match what would
>> happen on a Unix machine, so it seems like at least a violation of the
>> POLA. Hence I think we oughta swap the order of those two array
>> elements. (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
>> pgwin32_select.) I do not however

> Maybe we need a loop that checks for all events?

I don't think so. It's already the case that WaitLatch doesn't
guarantee that all possible flags are set in its result. In connection
with Peter G's observation that we could simplify the API by rechecking
PostmasterIsAlive for WL_POSTMASTER_DEATH, I was planning to clarify
the API spec as "result bits that are set are guaranteed to reflect
reality, but it's not guaranteed that we set every bit that could
possibly be set". This should not break any caller since the same
result could occur given a slight change in timing anyway; the caller
has to be prepared to come back and check for more conditions after it
services whatever WaitLatch does report. However, signal service is
not a condition the caller is supposed to deal with, so I think we
want a guarantee that that happens inside WaitLatch.

>> see a way that that would explain the
>> pgstat failures, because the stats collector's latch really shouldn't
>> ever get set during normal regression test runs.

> So could there be something wrong in the other end, meaning the latch
> *does* get set?

Even if it did, it'd get cleared at the top of the loop, so that the
next call ought to handle things. Tis a puzzlement. AFAICS the only
condition WaitforMultipleObjects is going to see in these tests is
read-ready on the socket; surely it wouldn't fail to notice that?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: "pgstat wait timeout" just got a lot more common on Windows
Date: 2012-05-10 16:52:23
Message-ID: 12478.1336668743@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Hence I think we oughta swap the order of those two array
> elements. (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
> pgwin32_select.)

Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously*
nasty bug: it does not reset ImmediateInterruptOK before returning.
How is it that Windows machines aren't falling over constantly?

regards, tom lane


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: "pgstat wait timeout" just got a lot more common on Windows
Date: 2012-05-11 09:48:28
Message-ID: CABUevEwzMw9zpCpreomUJSdWALbwNjFikdTJu4NbTe5EMrhvPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 10, 2012 at 6:52 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> Hence I think we oughta swap the order of those two array
>> elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
>> pgwin32_select.)
>
> Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously*
> nasty bug: it does not reset ImmediateInterruptOK before returning.
> How is it that Windows machines aren't falling over constantly?

Hmm. the commit you made to fix it says it changes how
ImmediateInterruptOK is handled, but there was not a single line of
code that actually changed that? Or am I misreading this completely?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: "pgstat wait timeout" just got a lot more common on Windows
Date: 2012-05-11 13:35:14
Message-ID: 2506.1336743314@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander <magnus(at)hagander(dot)net> writes:
> On Thu, May 10, 2012 at 6:52 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously*
>> nasty bug: it does not reset ImmediateInterruptOK before returning.
>> How is it that Windows machines aren't falling over constantly?

> Hmm. the commit you made to fix it says it changes how
> ImmediateInterruptOK is handled, but there was not a single line of
> code that actually changed that? Or am I misreading this completely?

Exit is now out the bottom of the loop, not by a raw "return;".

regards, tom lane


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: "pgstat wait timeout" just got a lot more common on Windows
Date: 2012-05-12 12:18:05
Message-ID: CABUevEzeta1MAYh8uzbtY2eH9ssd8L8G7a78izmxjw1_d1mbXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, May 11, 2012 at 3:35 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Magnus Hagander <magnus(at)hagander(dot)net> writes:
>> On Thu, May 10, 2012 at 6:52 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously*
>>> nasty bug: it does not reset ImmediateInterruptOK before returning.
>>> How is it that Windows machines aren't falling over constantly?
>
>> Hmm. the commit you made to fix it says it changes how
>> ImmediateInterruptOK is handled, but there was not a single line of
>> code that actually changed that? Or am I misreading this completely?
>
> Exit is now out the bottom of the loop, not by a raw "return;".

oh, d'uh. Sorry, missed that one completely.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/