Re: pg_listener entries deleted under heavy NOTIFY load only on Windows

Lists: pgsql-bugs
From: "Marshall, Steve" <smarshall(at)wsi(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-15 22:33:11
Message-ID: 496FB9A7.4080209@wsi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Under a heavy load of NOTIFY events, entries in the pg_listener table
for some events are deleted, effectively acting as though UNLISTEN were
called.

I have only been able to make this occur on a PostgreSQL server running
on Windows. I cannot reproduce under Linux.
PostgreSQL version: 8.3.4
Operating System: Windows XP

I have provided some shell script that can reproduce the problem. It
helps to use three psql sessions.

First start a session that listens for two events ("first" and
"second"), then periodically polls for notifications:
check_for_events | psql -h windows_server -U postgres postgres

Second, start an interactive session and execute the command "SELECT *
from pg_listener". You should see the events "first" and "second" listed.

Third, start a session to do repeated notification. The script produces
"NOTIFY first" and "NOTIFY second" in separate transactions repeatedly.
generate_events | psql -h windows_server -U postgres postgres

Continue to check the pg_listener table in the interactive session.
Very quickly one of the two events will disappear from the table.

Strangely, when I execute the same scripts against a database server on
a CentOS Linux system, this problem does not occur.

Attachment Content-Type Size
check_for_events text/plain 235 bytes
generate_events text/plain 573 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Marshall, Steve" <smarshall(at)wsi(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-16 03:24:36
Message-ID: 27010.1232076276@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Marshall, Steve" <smarshall(at)wsi(dot)com> writes:
> Under a heavy load of NOTIFY events, entries in the pg_listener table
> for some events are deleted, effectively acting as though UNLISTEN were
> called.

> I have only been able to make this occur on a PostgreSQL server running
> on Windows.

AFAICS the most likely explanation for this is that Send_Notify() gets
an error from kill() and concludes that the listening process has died
without removing its pg_listener entry; whereupon it removes it itself.
Looking at pgkill(), that theory implies that CallNamedPipe() failed
when under sufficient stress. I'm not sure what the "timeout" parameter
we use with CallNamedPipe actually limits, but maybe it's too small?
(Microsoft's doc suggests that the timeout only matters if the pipe
doesn't already exist, so I'm not sure I believe this theory; though
certainly the doc is vague enough that that reading could be wrong.)

Theory B is that you've got some broken antivirus code on there that
is arbitrarily interfering with the pipe access. The lack of any
similar previous reports suggests that there's some local issue
contributing ,,,

regards, tom lane


From: "Marshall, Steve" <smarshall(at)wsi(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-16 13:27:07
Message-ID: 8536F69C1FCC294B859D07B179F069441176B988@EXCHANGE.ad.wsicorp.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Marshall, Steve" <smarshall(at)wsi(dot)com> writes:
> Under a heavy load of NOTIFY events, entries in the pg_listener table
> for some events are deleted, effectively acting as though UNLISTEN
> were called.

> I have only been able to make this occur on a PostgreSQL server
> running on Windows.
"tom lane writes":
AFAICS the most likely explanation for this is that Send_Notify() gets
an error from kill() and concludes that the listening process has died
without removing its pg_listener entry; whereupon it removes it itself.
Looking at pgkill(), that theory implies that CallNamedPipe() failed
when under sufficient stress. I'm not sure what the "timeout" parameter
we use with CallNamedPipe actually limits, but maybe it's too small?
(Microsoft's doc suggests that the timeout only matters if the pipe
doesn't already exist, so I'm not sure I believe this theory; though
certainly the doc is vague enough that that reading could be wrong.)

Theory B is that you've got some broken antivirus code on there that is
arbitrarily interfering with the pipe access. The lack of any similar
previous reports suggests that there's some local issue contributing ,,,

To explore Theory B, I'll turn off all non-essential services on the
Windows server and rerun the test. I'll report back with what I find.

Any thoughts on how to confirm or deny Theory A?


From: "Marshall, Steve" <smarshall(at)wsi(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-16 13:39:24
Message-ID: 8536F69C1FCC294B859D07B179F069441176B9A2@EXCHANGE.ad.wsicorp.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tome Lane wrote:
Theory B is that you've got some broken antivirus code on there that is
arbitrarily interfering with the pipe access. The lack of any similar
previous reports suggests that there's some local issue contributing ,,,

Steve Marshall wrote:
To explore Theory B, I'll turned off all non-essential services on the
Windows server and rerun the test. This included turning of Symantec AV
and several other services we don't use anyway.

The test turned out exactly the same: one of the events was dropped from
pg_listener within a few seconds of starting the test.


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: "Marshall, Steve" <smarshall(at)wsi(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-16 14:22:01
Message-ID: 20090116142201.GD9963@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Marshall, Steve wrote:

> Any thoughts on how to confirm or deny Theory A?

Try changing the 1000 to NMPWAIT_WAIT_FOREVER

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: "Marshall, Steve" <smarshall(at)wsi(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-16 15:20:27
Message-ID: 6346.1232119227@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Marshall, Steve wrote:
>> Any thoughts on how to confirm or deny Theory A?

> Try changing the 1000 to NMPWAIT_WAIT_FOREVER

As long as you're changing the source code, it'd be a good idea to
verify the supposition that kill() is failing, eg
in src/backend/commands/async.c

if (kill(listenerPID, SIGUSR2) < 0)
{
+ elog(LOG, "kill(%d) failed: %m", listenerPID);
/*
* Get rid of pg_listener entry if it refers to a PID that no
* longer exists. Presumably, that backend crashed without
* deleting its pg_listener entries. This code used to only

If that's right, sprinkling a few debug printouts into src/port/kill.c
would be the next step.

regards, tom lane


From: "Marshall, Steve" <smarshall(at)wsi(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-20 14:02:15
Message-ID: 4975D967.1070805@wsi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
<br>
<blockquote cite="mid20090116142201(dot)GD9963(at)alvh(dot)no-ip(dot)org" type="cite">
<pre wrap="">Marshall, Steve wrote:
</pre>
<blockquote type="cite">
<pre wrap="">Any thoughts on how to confirm or deny Theory A?
</pre>
</blockquote>
<pre wrap=""><!---->Alvaro Herrera wrote:
Try changing the 1000 to NMPWAIT_WAIT_FOREVER
</pre>
</blockquote>
I'd tried Alvaro suggestion and it fixed the problem in my test case.&nbsp;
With the original code, the test would consistantly drop one of the
events from pg_listener within a few seconds.&nbsp; With the changed
timeout, I've run it for about 10 minutes, or 100,000 notifications of
each event, and it worked without error.<br>
<br>
Are there any side effects from effectively disabling this timeout that
should be tested?<br>
<br>
<br>
</body>
</html>

Attachment Content-Type Size
unknown_filename text/html 1.0 KB

From: "Marshall, Steve" <smarshall(at)wsi(dot)com>
To: <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-27 22:46:33
Message-ID: 8536F69C1FCC294B859D07B179F0694411A4CEF8@EXCHANGE.ad.wsicorp.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

As I posted before, changing the timeout from 1000 to
NMPWAIT_WAIT_FOREVER fixed the problem, or at least moved it such it
does not occur easily anymore.

To better understand the problem, I added debugging as Tom suggested. I
restored timeout on CalledNamedPipe 1000 ms, and reran my tests.
Indeed, kill is encountering an error:
LOG: kill(2168) failed: No such process

I instrumented pgkill to output the value of GetLastError() if
CalledNamedPipe fails. It returned error code 2, which Windows
identifies as ERROR_FILE_NOT_FOUND. The logic in pgkill translates this
Windows error into an errno value of ESRCH.

The Windows error is a bit surprising, at least to me -- I expected
something indicating the pipe was full. Does anyone have a richer
interpretation of this error?

Thanks,
Steve

-----Original Message-----
From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Marshall, Steve wrote:
>> Any thoughts on how to confirm or deny Theory A?

> Try changing the 1000 to NMPWAIT_WAIT_FOREVER

As long as you're changing the source code, it'd be a good idea to
verify the supposition that kill() is failing, eg in
src/backend/commands/async.c

if (kill(listenerPID, SIGUSR2) < 0)
{
+ elog(LOG, "kill(%d) failed: %m",
listenerPID);
/*
* Get rid of pg_listener entry if it
refers to a PID that no
* longer exists. Presumably, that
backend crashed without
* deleting its pg_listener entries.
This code used to only

If that's right, sprinkling a few debug printouts into src/port/kill.c
would be the next step.

regards, tom lane


From: "Marshall, Steve" <smarshall(at)wsi(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-28 17:34:49
Message-ID: 49809739.6060009@wsi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

*** async.c.8.3.5 2009-01-28 12:25:11.966356419 -0500
--- async.c 2009-01-28 12:26:51.955740151 -0500
***************
*** 101,107 ****
#include "utils/fmgroids.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
!

/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
--- 101,107 ----
#include "utils/fmgroids.h"
#include "utils/memutils.h"
#include "utils/ps_status.h"
! #include "storage/procarray.h"

/*
* State for pending LISTEN/UNLISTEN actions consists of an ordered list of
***************
*** 628,633 ****
--- 628,634 ----
}
else
{
+ int listenerExists = 1;
if (Trace_notify)
elog(DEBUG1, "AtCommit_Notify: notifying pid %d",
listenerPID);
***************
*** 649,657 ****
* we should just do it for any failure (certainly at least
* for EPERM too...)
*/
! simple_heap_delete(lRel, &lTuple->t_self);
}
! else if (listener->notification == 0)
{
/* Rewrite the tuple with my PID in notification column */
rTuple = heap_modifytuple(lTuple, tdesc, value, nulls, repl);
--- 650,670 ----
* we should just do it for any failure (certainly at least
* for EPERM too...)
*/
! if (IsBackendPid(listenerPID))
! {
! elog(LOG, "kill(%d) SIGUSR2 failed, but process is running: %m",
! listenerPID);
! }
! else
! {
! elog(LOG, "kill(%d) SIGUSR2 failed, process is NOT running: %m",
! listenerPID);
! listenerExists = 0;
! simple_heap_delete(lRel, &lTuple->t_self);
! }
}
!
! if (listenerExists && listener->notification == 0)
{
/* Rewrite the tuple with my PID in notification column */
rTuple = heap_modifytuple(lTuple, tdesc, value, nulls, repl);

Attachment Content-Type Size
unknown_filename text/html 4.9 KB
async.c.notify_patch text/plain 1.8 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Marshall, Steve" <smarshall(at)wsi(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-28 17:44:20
Message-ID: 19100.1233164660@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Marshall, Steve" <smarshall(at)wsi(dot)com> writes:
> Essentially, the new error handling in async.c allows postgres to fail
> in its efforts to signal a process about the presence of a NOTIFY event
> without invalidating the subscription (LISTEN) for that event.

But why would that be a good idea? It seems to me to be a bandaid that
guarantees misbehavior in the infrequent-notification case.

The real question here, which this doesn't advance us towards solving,
is why is the CallNamedPipe call sometimes failing (or at least taking
longer than it seems it ought to).

regards, tom lane


From: "Marshall, Steve" <smarshall(at)wsi(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-28 18:11:26
Message-ID: 8536F69C1FCC294B859D07B179F0694411A4D4C8@EXCHANGE.ad.wsicorp.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

The reason preserving the subscription is good is because the current
behavior silently drops the subscription without informing the
subscribing process in any way. This means applications waiting for
changes to a database tables have the mistaken impression no changes
have occurred.

The reason I think updating the pg_listener tuple even if kill fails may
be based on a mistaken understanding of asynchronous notification. We
could easily avoid it, if it's not the right thing to do. My thought
was that a single byte was written to the named pipe to inform the
listener of the presence of new events, but that the knowledge of what
event(s) are pending is kept in the pg_listener table. Since this
problem only occurs under heavy notification load (i.e. when the pipe
already has data in it), the listener is going to get signalled about
the presence of new events even if one particular kill() call failed.
Let me know if I misunderstand the use of the named pipe and the
pg_listener table in event notification.

Given my understanding, it seems like a good idea to ensure the state of
pg_listener is updated with all events that were received. If we don't
do that, we risk the case that the one kill that fails if for an event
that was only issued that one time. In that case, we would fail to
alert the application that this event occurred.

I don't think a check for process existance is a bad idea, or even a
bandaid. The comment in the code block in async.c says it is removing
the entry in pg_listener because the backend process does not exist.
The code assumes the process is dead, but does not check it. I don't
think it is unreasonable to check this assumption, and behave
accordingly.

If you have suggestions on how to get more information on the cause of
the error, I'd be be willing to look into them. The microsoft
documentation only describes a couple of error conditions for
CallNamedPipe (error 31 is not one of them), and the error code itself
is not very descriptive ("A device attached to the system is not
functioning"). I'm open to ideas on how to dig further.

Steve

-----Original Message-----
From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
Sent: Wednesday, January 28, 2009 12:44 PM
To: Marshall, Steve
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: [BUGS] pg_listener entries deleted under heavy NOTIFY load
only on Windows

"Marshall, Steve" <smarshall(at)wsi(dot)com> writes:
> Essentially, the new error handling in async.c allows postgres to fail

> in its efforts to signal a process about the presence of a NOTIFY
> event without invalidating the subscription (LISTEN) for that event.

But why would that be a good idea? It seems to me to be a bandaid that
guarantees misbehavior in the infrequent-notification case.

The real question here, which this doesn't advance us towards solving,
is why is the CallNamedPipe call sometimes failing (or at least taking
longer than it seems it ought to).

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Marshall, Steve" <smarshall(at)wsi(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-28 18:41:13
Message-ID: 20137.1233168073@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Marshall, Steve" <smarshall(at)wsi(dot)com> writes:
> I don't think a check for process existance is a bad idea, or even a
> bandaid. The comment in the code block in async.c says it is removing
> the entry in pg_listener because the backend process does not exist.

In general, the way to see if a process exists is to try to kill() it;
the fact that the kill failed is sufficient proof, at least in
Unix-land. If it's possible for kill() to fail for transient reasons
in our Windows implementation, that's a bug in the Windows emulation
of kill.

Another reason behind the async.c coding is that even if the process
does still exist, there's no point in maintaining a pg_listener entry
for it if we can't signal it.

Thirdly, this is hardly the only place where we expect kill() to work
reliably. You've managed to create a reproducible case illustrating
that it's not being reliable, but the same bug might account for other
failures much harder to reproduce and investigate.

So my opinion is that the real issue here is why is the kill()
implementation failing when it should not. We need to fix that,
not put band-aids in async.c.

As to how to fix it, I'll defer to other people more
Windows-knowledgeable. Maybe taking out the timeout is really
the best answer.

regards, tom lane


From: "Marshall, Steve" <smarshall(at)wsi(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-30 14:01:09
Message-ID: 49830825.1000408@wsi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom Lane wrote:

>So my opinion is that the real issue here is why is the kill()
>implementation failing when it should not. We need to fix that,
>not put band-aids in async.c.
>
>
I think Tom makes a valid point. To that end, I think the Window's kill
implementation needs to be changed to have these properties:
1. kill should generally return quickly and should never hang
2. kill should never fail for a transient reason
3. if kill fails, it should be a good indication that the process is
dead or we are permanently unable to communicate with it.

The current implementation has property #1, but not #2 and #3. However,
I think I've figured out a simple way to modify the Window's
implementation of pgkill to achieve all three properties. I will do
some long term testing of the changes over the weekend, just to check my
solution works properly over longer time scales.

Here is what I've found so far:
* Contrary to my previous reports, the notification error is always the
result of pgkill failing with error code 2 (ERROR_FILE_NOT_FOUND). I
had previously thought it had issued error 31, but this was just an
error in my debug message (the signal was 31, i.e. SIGUSR2).
* Also contrary to what I wrote previously, long or infinite timeouts do
not fix the problem. With an infinite timeout, I've avoided the problem
for as long as ten minutes, but it eventually happens. In some cases,
the problem even occurred quickly with an infinite timeout.

The solution that seems to work is to call CallNamedPipe repeatedly in a
loop if it fails. Currently, I call the function up to a maximum of 5
times, although in all test cases so far, the code has never needed more
than 1 retry to succeed. Based on my testing, I may reduce the maximum
number of retries. The code sleeps for 20 ms between retries. I
reduced the timeout for CallNamedPipe from 1000 ms to 250 ms after the
first call, to reduce the total time for the signal if we hit a case
that needs to timeout.

I also notice that signals that should fail, do fail. For example,
signal 30 seems to be regularly sent to pid 1, and this fails in both
the orignal code and my modified version.

Theoretically, I'm not entirely sure why CallNamedPipe fails
occasionally, but will succeed when called with the same arguments a
very short while later. It's hard to know without being able to see the
source code. However, from the Windows documentation, it seems like a
single named pipe in Windows can have several "instances", which seem to
be access interfaces to the pipe. I suspect there is some race
condition where the code erroneously decides it needs to create an
"instance" of the pipe, rather than waiting for an instance to become
available. When the instance creation fails, it generates the
FILE_NOT_FOUND error.

I'll post back on Monday with more complete test results, and, if all
goes well, a patch. If anyone has ideas on what else should be tested,
please let me know.

Steve


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Marshall, Steve" <smarshall(at)wsi(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-30 15:18:28
Message-ID: 8699.1233328708@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"Marshall, Steve" <smarshall(at)wsi(dot)com> writes:
> The solution that seems to work is to call CallNamedPipe repeatedly in a
> loop if it fails.

Seems a bit nervous-making, since there's no principled way to know how
many retries to try.

> I also notice that signals that should fail, do fail. For example,
> signal 30 seems to be regularly sent to pid 1, and this fails in both
> the orignal code and my modified version.

That's probably this bug:
http://archives.postgresql.org/pgsql-bugs/2008-11/msg00077.php
http://archives.postgresql.org/pgsql-committers/2008-11/msg00132.php

It shouldn't be happening, but it does offer a convenient test case that
your patch still fails when it should fail ;-)

regards, tom lane


From: "Marshall, Steve" <smarshall(at)wsi(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-02-03 19:43:25
Message-ID: 49889E5D.8020306@wsi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

I've completed my test for changes to the Windows implementation of
pgkill. I ran two clients constantly doing a pair of notifications,
with one client subscribing to those events. I ran the test for 90
hours. I found 760 cases where the CallNamedPipe function initially
failed, but succeeded after recalling. In all but one of those cases,
only a single retry was needed; the other case required two retries.

As Tom has pointed out, there is no principled way to determine the
correct number of times to retry. Therefore, I have set the maximum
number of retries to 2, which is the worst case found in a fairly long
stress test. This strikes me as a reasonable balance between failing
quickly in truly erroneous cases, while still avoiding transient
pseudo-failures.

A patch with this change is attached. It should be applied to the
8.3.5 version of src/port/kill.c

Steve

Attachment Content-Type Size
kill.c.notify_dropping_patch text/plain 1.4 KB

From: "Marshall, Steve" <smarshall(at)wsi(dot)com>
To: <pgsql-bugs(at)postgresql(dot)org>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-02-09 14:58:03
Message-ID: 8536F69C1FCC294B859D07B179F0694411CA0277@EXCHANGE.ad.wsicorp.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Any thoughts on whether or not this patch should be included in a future
PostgreSQL release? If so, any thoughts on additional testing, etc,
that needs to be done? Should the patch be added to a minor release of
8.3, or only included forward in 8.4? In my opinion, this is a pretty
serious error causing some major problems and workarounds in systems I
am working on, should I would vote for inclusion in minor releases.

The post below seems to have become disassociated from the rest of the
thread on this bug, perhaps because it was issued in February, and the
rest of the thread was posted in January. Is there any way to make the
mailing list archives link these messages back to the rest of the
thread?

-----Original Message-----
I've completed my test for changes to the Windows implementation of
pgkill. I ran two clients constantly doing a pair of notifications,
with one client subscribing to those events. I ran the test for 90
hours. I found 760 cases where the CallNamedPipe function initially
failed, but succeeded after recalling. In all but one of those cases,
only a single retry was needed; the other case required two retries.

As Tom has pointed out, there is no principled way to determine the
correct number of times to retry. Therefore, I have set the maximum
number of retries to 2, which is the worst case found in a fairly long
stress test. This strikes me as a reasonable balance between failing
quickly in truly erroneous cases, while still avoiding transient
pseudo-failures.

A patch with this change is attached. It should be applied to the
8.3.5 version of src/port/kill.c

Steve


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: "Marshall, Steve" <smarshall(at)wsi(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-02-09 15:17:17
Message-ID: 200902091517.n19FHHY19664@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Marshall, Steve wrote:
> Any thoughts on whether or not this patch should be included in a future
> PostgreSQL release? If so, any thoughts on additional testing, etc,
> that needs to be done? Should the patch be added to a minor release of
> 8.3, or only included forward in 8.4? In my opinion, this is a pretty
> serious error causing some major problems and workarounds in systems I
> am working on, should I would vote for inclusion in minor releases.
>
> The post below seems to have become disassociated from the rest of the
> thread on this bug, perhaps because it was issued in February, and the
> rest of the thread was posted in January. Is there any way to make the
> mailing list archives link these messages back to the rest of the
> thread?

I am tracking this item and was waiting for someone to reply to it.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: "Marshall, Steve" <smarshall(at)wsi(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-02-10 14:29:23
Message-ID: 49918F43.5000307@hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Marshall, Steve wrote:
> Any thoughts on whether or not this patch should be included in a future
> PostgreSQL release? If so, any thoughts on additional testing, etc,
> that needs to be done? Should the patch be added to a minor release of
> 8.3, or only included forward in 8.4? In my opinion, this is a pretty
> serious error causing some major problems and workarounds in systems I
> am working on, should I would vote for inclusion in minor releases.
>
> The post below seems to have become disassociated from the rest of the
> thread on this bug, perhaps because it was issued in February, and the
> rest of the thread was posted in January. Is there any way to make the
> mailing list archives link these messages back to the rest of the
> thread?

Hi!

Sorry, I've been unable to track this thread properly as it happened tue
to travelling, but I've tried to read back through it now.

What's the timing of the errors? Is there a chance that we are sending
the kill signal before the signal handling thread has actually started
*and created the named pipe*? We set up the signal handling stuff pretty
early, but we do seem to let the postmaster continue it's work before
it's up...

Going back to your tests, did it ever require more than one retry?

//Magnus


From: "Marshall, Steve" <smarshall(at)wsi(dot)com>
To: "Magnus Hagander" <magnus(at)hagander(dot)net>
Cc: <pgsql-bugs(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-02-10 14:42:54
Message-ID: 8536F69C1FCC294B859D07B179F0694411D4BA6A@EXCHANGE.ad.wsicorp.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


What's the timing of the errors? Is there a chance that we are sending
the kill signal before the signal handling thread has actually started
*and created the named pipe*?

We set up the signal handling stuff pretty early, but we do seem to let
the postmaster continue it's work before it's up...

Under heavy load, a signal will typically be dropped within the first
few minutes. However, it can sometimes take a little while before the
problem happens. Thousands of the same signal to the same process may
be properly handled before one is mishandled. This is not consistant
with a problem with initial creation of the pipe.

Going back to your tests, did it ever require more than one retry?

Yes, but rarely. In a 90 hour stress test with code that allowed up to 5
calls to CallNamedPipe, I found 760 signals that required a retry. Only
one required two retries. That is why I set the number of retries to 2.
The behavior might be different if the sleep interval between retries
was changed. I used a 20 ms sleep interval between retries in all my
tests, and in the patch I sent.


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: "Marshall, Steve" <smarshall(at)wsi(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-02-15 13:59:05
Message-ID: 49981FA9.8050407@hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Marshall, Steve wrote:
> What's the timing of the errors? Is there a chance that we are sending
> the kill signal before the signal handling thread has actually started
> *and created the named pipe*?
>
> We set up the signal handling stuff pretty early, but we do seem to let
> the postmaster continue it's work before it's up...
>
> Under heavy load, a signal will typically be dropped within the first
> few minutes. However, it can sometimes take a little while before the
> problem happens. Thousands of the same signal to the same process may
> be properly handled before one is mishandled. This is not consistant
> with a problem with initial creation of the pipe.
>
> Going back to your tests, did it ever require more than one retry?
>
> Yes, but rarely. In a 90 hour stress test with code that allowed up to 5
> calls to CallNamedPipe, I found 760 signals that required a retry. Only
> one required two retries. That is why I set the number of retries to 2.
> The behavior might be different if the sleep interval between retries
> was changed. I used a 20 ms sleep interval between retries in all my
> tests, and in the patch I sent.

I have applied a modified version (no functional changes, just
stylistic) of this patch to head and backpatched to 8.2 which is as far
back as we support win32.

//Magnus