Re: SIGUSR1 pingpong between master na autovacum launcher causes crash

Lists: pgsql-hackers
From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-21 13:22:34
Message-ID: 1250860954.1239.114.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I found following core file of PG 8.4.0 on my system (Solaris Nevada
b119):

fe8ae42d _dowrite (85bf6e8, 3a, 8035e3c, 80350e8) + 8d
fe8ae743 _ndoprnt (85bf6e8, 8035ec8, 8035e3c, 0) + 2ba
fe8b322d vsnprintf (85bfaf0, 3ff, 85bf6e8, 8035ec8, 0, 0) + 65
082194ea appendStringInfoVA (8035e9c, 85bf6e8, 8035ec8) + 4a
083ca5d3 errmsg (849c340, 0) + 103
0829272d StartAutoVacWorker (fe97f000, 32, 85b82b0, 8035ef4, 82a1496, c) + 3d
082a1901 StartAutovacuumWorker (c, 8035f08, fe8ed28f, 10, 0, 8035fbc) + 71
082a1496 sigusr1_handler (10, 0, 8035fbc) + 186
fe8ed28f __sighndlr (10, 0, 8035fbc, 82a1310) + f
fe8e031f call_user_handler (10) + 2af
fe8e054f sigacthandler (10, 0, 8035fbc) + df
--- called from signal handler with signal 16 (SIGUSR1) ---
fe8f37f6 __systemcall (3, fec32b88, 0, fe8e0b46) + 6
fe8e0c71 thr_sigsetmask (3, 85abd50, 0, fe8e0d18) + 139
fe8e0d3f sigprocmask (3, 85abd50, 0) + 31
082a14a4 sigusr1_handler (10, 0, 8036340) + 194
fe8ed28f __sighndlr (10, 0, 8036340, 82a1310) + f
fe8e031f call_user_handler (10) + 2af
fe8e054f sigacthandler (10, 0, 8036340) + df

... 80x same sighandler stack

--- called from signal handler with signal 16 (SIGUSR1) ---
fe8f37f6 __systemcall (3, fec32b88, 0, fe8e0b46) + 6
fe8e0c71 thr_sigsetmask (3, 85abd50, 0, fe8e0d18) + 139
fe8e0d3f sigprocmask (3, 85abd50, 0) + 31
082a14a4 sigusr1_handler (10, 0, 80478fc) + 194
fe8ed28f __sighndlr (10, 0, 80478fc, 82a1310) + f
fe8e031f call_user_handler (10) + 2af
fe8e054f sigacthandler (10, 0, 80478fc) + df
--- called from signal handler with signal 16 (SIGUSR1) ---
fe8f1867 __pollsys (8047b50, 2, 8047c04, 0) + 7
fe89ce61 pselect (6, 8047c44, 0, 0, 8047c04, 0) + 199
fe89d236 select (6, 8047c44, 0, 0, 8047c38, 0) + 78
0829dc20 ServerLoop (feffb804, bd26003b, 41b21fcb, 85c1de0, 1, 0) + c0
0829d5d0 PostmasterMain (3, 85b72c8) + dd0
08227abf main (3, 85b72c8, 8047df0, 8047d9c) + 22f
080b893d _start (3, 8047e80, 8047ea5, 8047ea8, 0, 8047ec2) + 7d

The problem what I see here is that StartAutovacuumWorker() fails and
send SIGUSR1 to the postmaster, but it send it too quickly and signal
handler is still active. When signal mask is unblocked in
sigusr1_handler() than signal handler is run again...

The reason why StartAutovacuumWorker() is interesting. Log says:

LOG: could not fork autovacuum worker process: Not enough space

It is strange and I don't understand it. May be too many nested signal
handlers call could cause it.

Strange also is that 100ms is not enough to protect this situation, but
I think that sleep could interrupted by signal.

My suggestion is to set for example gotUSR1=true in sigusr1_handler()
and in the server loop check if we got a USR1 signal. It avoids any
problems with signal handler which is not currently POSIX compliant
anyway.

any other ideas?

Zdenek


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-21 19:40:14
Message-ID: 20090821194014.GK5487@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala wrote:

> The problem what I see here is that StartAutovacuumWorker() fails and
> send SIGUSR1 to the postmaster, but it send it too quickly and signal
> handler is still active. When signal mask is unblocked in
> sigusr1_handler() than signal handler is run again...
>
> The reason why StartAutovacuumWorker() is interesting. Log says:
>
> LOG: could not fork autovacuum worker process: Not enough space

Does this mean that the machine is out of swap space?

> It is strange and I don't understand it. May be too many nested signal
> handlers call could cause it.
>
> Strange also is that 100ms is not enough to protect this situation, but
> I think that sleep could interrupted by signal.
>
> My suggestion is to set for example gotUSR1=true in sigusr1_handler()
> and in the server loop check if we got a USR1 signal. It avoids any
> problems with signal handler which is not currently POSIX compliant
> anyway.

What 100ms? The pg_usleep call you see in ServerLoop is only there
during shutdown; normally it would be the select() call that would be
blocking the process.

If sigusr1_handler needs rewriting, don't all the other sighandler as
well? Note that the process is supposed to be running with signals
blocked all the time except during those sleep/select calls, which is
what (according to comments) let the sighandlers do nontrivial tasks.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-21 20:01:23
Message-ID: 1250884883.1320.66.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Alvaro Herrera píše v pá 21. 08. 2009 v 15:40 -0400:
> Zdenek Kotala wrote:
>
> > The problem what I see here is that StartAutovacuumWorker() fails and
> > send SIGUSR1 to the postmaster, but it send it too quickly and signal
> > handler is still active. When signal mask is unblocked in
> > sigusr1_handler() than signal handler is run again...
> >
> > The reason why StartAutovacuumWorker() is interesting. Log says:
> >
> > LOG: could not fork autovacuum worker process: Not enough space
>
> Does this mean that the machine is out of swap space?

It is ENOMEM error. But it is strange. Machine has 4GB RAM and it was
freshly installed PG84 without any data and with default configuration.
It was not under load. I did not find any clue what happend with memory
on this system. The question is if out of memory was a source or result
of the pinpong.

>
> > It is strange and I don't understand it. May be too many nested signal
> > handlers call could cause it.
> >
> > Strange also is that 100ms is not enough to protect this situation, but
> > I think that sleep could interrupted by signal.
> >
> > My suggestion is to set for example gotUSR1=true in sigusr1_handler()
> > and in the server loop check if we got a USR1 signal. It avoids any
> > problems with signal handler which is not currently POSIX compliant
> > anyway.
>
> What 100ms? The pg_usleep call you see in ServerLoop is only there
> during shutdown; normally it would be the select() call that would be
> blocking the process.

I mean AutoVacLauncherMain()
http://doxygen.postgresql.org/autovacuum_8c.html#19ef1013e6110a4536ed92a454aba8c9
line 656

> If sigusr1_handler needs rewriting, don't all the other sighandler as
> well? Note that the process is supposed to be running with signals
> blocked all the time except during those sleep/select calls, which is
> what (according to comments) let the sighandlers do nontrivial tasks.

Comments says that it is OK. POSIX says that is not OK and my instinct
say to trust the POSIX standard. Especially I do not see any reason why
we need do this in signal handler. avl_sigterm_handler and so on are
good example how it should be implemented in postmaster as well.

The core shows that it is not good idea to have complicated signal
handler.

Zdenek


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-21 20:18:12
Message-ID: 9594.1250885892@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> If sigusr1_handler needs rewriting, don't all the other sighandler as
> well?

It does not, and neither do they. I'm not sure what happened here but
it wasn't the fault of the postmaster's organization of signal handlers.

It does seem that we ought to change things so that there's a bit more
delay before trying to re-launch a failed autovac worker, though.
Whatever caused this was effectively turning the autovac logic into
a fork-bomb engine. I'm not thinking of just postponing the relaunch
into the main loop, but ensuring at least a few hundred msec delay before
we try again.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-21 21:33:15
Message-ID: 20090821213315.GO5487@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > If sigusr1_handler needs rewriting, don't all the other sighandler as
> > well?
>
> It does not, and neither do they. I'm not sure what happened here but
> it wasn't the fault of the postmaster's organization of signal handlers.
>
> It does seem that we ought to change things so that there's a bit more
> delay before trying to re-launch a failed autovac worker, though.
> Whatever caused this was effectively turning the autovac logic into
> a fork-bomb engine. I'm not thinking of just postponing the relaunch
> into the main loop, but ensuring at least a few hundred msec delay before
> we try again.

Would it be enough to move the kill() syscall into ServerLoop in
postmaster.c instead of letting it be called in the signal handler, per
the attached patch? This way the signal is not delayed, but we exit the
signal handler before doing it.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment Content-Type Size
autovac-signal.patch text/x-diff 2.2 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-21 21:38:32
Message-ID: 10974.1250890712@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> It does seem that we ought to change things so that there's a bit more
>> delay before trying to re-launch a failed autovac worker, though.
>> Whatever caused this was effectively turning the autovac logic into
>> a fork-bomb engine. I'm not thinking of just postponing the relaunch
>> into the main loop, but ensuring at least a few hundred msec delay before
>> we try again.

> Would it be enough to move the kill() syscall into ServerLoop in
> postmaster.c instead of letting it be called in the signal handler, per
> the attached patch? This way the signal is not delayed, but we exit the
> signal handler before doing it.

I'd still like to have some fork-rate-limiting behavior in there
somewhere. However, it might make sense for the avlauncher to do that
rather than the postmaster. Does that idea seem more implementable?

(If the launcher implements a minimum delay between requests then it
really doesn't matter whether we apply this patch or not, and I'd be
inclined to leave the postmaster alone rather than add yet more state.)

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-21 21:48:11
Message-ID: 20090821214811.GP5487@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> I'd still like to have some fork-rate-limiting behavior in there
> somewhere. However, it might make sense for the avlauncher to do that
> rather than the postmaster. Does that idea seem more implementable?

Well, there's already rate limiting in the launcher:

if (AutoVacuumShmem->av_signal[AutoVacForkFailed])
{
/*
* If the postmaster failed to start a new worker, we sleep
* for a little while and resend the signal. The new worker's
* state is still in memory, so this is sufficient. After
* that, we restart the main loop.
*
* XXX should we put a limit to the number of times we retry?
* I don't think it makes much sense, because a future start
* of a worker will continue to fail in the same way.
*/
AutoVacuumShmem->av_signal[AutoVacForkFailed] = false;
pg_usleep(100000L); /* 100ms */
SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER);
continue;
}

Does it just need a longer delay?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-21 22:06:25
Message-ID: 11406.1250892385@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> I'd still like to have some fork-rate-limiting behavior in there
>> somewhere. However, it might make sense for the avlauncher to do that
>> rather than the postmaster. Does that idea seem more implementable?

> Well, there's already rate limiting in the launcher:

[ scratches head... ] You know, as I was writing that email the concept
seemed a bit familiar. But if that's in there, how the heck did the
launcher manage to bounce back to the postmaster before the latter got
out of its signal handler? Have you tested this actually works as
intended? Could Zdenek have tested a version that lacks it?

> Does it just need a longer delay?

Maybe, but I think we need to understand exactly what happened first.

regards, tom lane


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-22 08:30:30
Message-ID: 1250929830.1307.3.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Alvaro Herrera píše v pá 21. 08. 2009 v 17:48 -0400:
> Tom Lane wrote:
>
> > I'd still like to have some fork-rate-limiting behavior in there
> > somewhere. However, it might make sense for the avlauncher to do that
> > rather than the postmaster. Does that idea seem more implementable?
>
<snip>
>
> Does it just need a longer delay?

I see two problem with delay. First is how big it should be? It depends
on machine speed and load. And second is that it can be interrupted by
signal.

Zdenek


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-22 09:30:45
Message-ID: 1250933445.1307.62.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane píše v pá 21. 08. 2009 v 18:06 -0400:

> Maybe, but I think we need to understand exactly what happened first.

I try to mine more data from the system to reconstruct what happen.
Unfortunately, default postgresql log configuration does not have
timestamp. The postgresql had no load, system was under load which could
cause some slow system response. ... and ... bingo I'm able to reproduce
it.

I will look on it deeply with dtrace on Monday (I'm leaving now for a
weekend).

There are most important records from yesterdays issues.
Messages:
---------
Aug 20 11:14:54 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 507 (postgres)

svc log:
--------
[ Aug 20 10:53:27 Stopping because service disabled. ]
[ Aug 20 10:53:27 Executing stop method ("/lib/svc/method/postgres_84 stop"). ]
waiting for server to shut down..... done
server stopped
[ Aug 20 10:53:31 Method "stop" exited with status 0. ]
[ Aug 20 10:54:17 Executing start method ("/lib/svc/method/postgres_84 start"). ]
server starting
[ Aug 20 10:54:19 Method "start" exited with status 0. ]
[ Aug 20 11:14:58 Stopping because process dumped core. ]
[ Aug 20 11:16:05 Executing stop method ("/lib/svc/method/postgres_84 stop"). ]
pg_ctl: could not send stop signal (PID: 507): No such process
[ Aug 20 11:16:12 Method "stop" exited with status 0. ]
[ Aug 20 11:17:17 Executing start method ("/lib/svc/method/postgres_84 start"). ]
ld.so.1: sh: fatal: mmap anon failed: Resource temporarily unavailable
[ Aug 20 11:17:28 Method "start" failed due to signal KILL. ]
[ Aug 20 11:18:39 Executing start method ("/lib/svc/method/postgres_84 start"). ]
ld.so.1: sh: fatal: mmap anon failed: Resource temporarily unavailable
[ Aug 20 11:18:50 Method "start" failed due to signal KILL. ]
[ Aug 20 11:19:58 Executing start method ("/lib/svc/method/postgres_84 start"). ]
ld.so.1: sh: fatal: mmap anon failed: Resource temporarily unavailable

postgresql.log:
---------------
LOG: received smart shutdown request
LOG: autovacuum launcher shutting down
LOG: shutting down
LOG: database system is shut down
LOG: database system was shut down at 2009-08-20 10:53:30 CEST
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: could not fork autovacuum worker process: Not enough space
...
TopMemoryContext: 31192 total in 4 blocks; 6056 free (4 chunks); 25136 used
TopTransactionContext: 8192 total in 1 blocks; 7792 free (1 chunks); 400 used
TransactionAbortContext: 32768 total in 1 blocks; 32752 free (0 chunks); 16 used
Portal hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used
PortalMemory: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Relcache by OID: 8192 total in 1 blocks; 4936 free (0 chunks); 3256 used
CacheMemoryContext: 274480 total in 18 blocks; 17072 free (2 chunks); 257408 used
<snip rest of memory context dump>
ERROR: out of memory
DETAIL: Failed on request of size 8.
LOG: could not fork autovacuum worker process: Not enough space
...
LOG: could not fork autovacuum worker process: Not enough space
WARNING: worker took too long to start; cancelled
LOG: could not fork autovacuum worker process: Not enough space
...
LOG: could not fork autovacuum worker process: Not enough space
WARNING: worker took too long to start; cancelled
2009-08-21 13:30:48.701 CEST 416 LOG: database system was interrupted; last known up at 2009-08-20 10:54:21 CEST
2009-08-21 13:30:48.963 CEST 416 LOG: database system was not properly shut down; automatic recovery in progress


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-22 13:56:22
Message-ID: 8502.1250949382@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> There are most important records from yesterdays issues.
> Messages:
> ---------
> Aug 20 11:14:54 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 507 (postgres)

Hmm, that seems to confirm the idea that something had run the machine
out of memory/swap space, which would explain the repeated ENOMEM fork
failures. But we're still no closer to understanding how come the
delay in the avlauncher didn't do what it was supposed to.

regards, tom lane


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-24 11:47:27
Message-ID: 1251114447.3252.16.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane píše v so 22. 08. 2009 v 09:56 -0400:
> Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> > There are most important records from yesterdays issues.
> > Messages:
> > ---------
> > Aug 20 11:14:54 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 507 (postgres)
>
> Hmm, that seems to confirm the idea that something had run the machine
> out of memory/swap space, which would explain the repeated ENOMEM fork
> failures. But we're still no closer to understanding how come the
> delay in the avlauncher didn't do what it was supposed to.

I found hungry process which eats up all memory and fortunately it is
not postgres :-).

I run also following dtrace script:

dtrace -n 'syscall::kill:entry / execname=="postgres"/ { printf("%i %
s, %i->%i : %i", timestamp, execname, pid, arg0, arg1); }'

and it show following (little bit modified) output:

<snip>
CPU Timestamp[ns] diff[ms] caller callee sig
0 2750745000052090 899,96 28604 -> 28608 16
3 2750745100280460 100,23 28608 -> 28604 16
1 2750746000144690 899,86 28604 -> 28608 16
3 2750746100380940 100,24 28608 -> 28604 16
2 2750747000135380 899,75 28604 -> 28608 16
3 2750747100171650 100,04 28608 -> 28604 16
0 2750748000101050 899,93 28604 -> 28608 16
3 2750748100331900 100,23 28608 -> 28604 16
1 2750749000148550 899,82 28604 -> 28608 16
3 2750749100386640 100,24 28608 -> 28604 16
2 2750750000095040 899,71 28604 -> 28608 16
3 2750750100127780 100,03 28608 -> 28604 16

We can see there that AVlauncher really wait 100ms, but it is not enough
when system is under stress.

I tested Alvaro's patch and it works, because it does not lead to stack
consumption, but it shows another bug in StartAutovacuumWorker() code.
When fork fails bn structure is freed but
ReleasePostmasterChildSlot() should be called as well. See error:

2009-08-24 11:50:20.360 CEST 3468 FATAL: no free slots in PMChildFlags array

and comment in source code:

/* Out of slots ... should never happen, else postmaster.c messed up */

I think that Alvaro's patch is good and it fix a crash problem. I also
think that AVlauncher could wait little bit more then 100ms. When system
cannot fork, I don't see any reason why hurry to repeat a fork
operation. Maybe 1s is good compromise.

Zdenek


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-24 12:36:27
Message-ID: 1251117387.3252.24.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Zdenek Kotala píše v po 24. 08. 2009 v 13:47 +0200:

> I tested Alvaro's patch and it works, because it does not lead to stack
> consumption, but it shows another bug in StartAutovacuumWorker() code.
> When fork fails bn structure is freed but
> ReleasePostmasterChildSlot() should be called as well. See error:
>
> 2009-08-24 11:50:20.360 CEST 3468 FATAL: no free slots in PMChildFlags array
>
> and comment in source code:
>
> /* Out of slots ... should never happen, else postmaster.c messed up */
>

Just a confirmation that Alvaro's patch+ReleasePostmasterChildSlot() fix
solves the problem and PostgreSQL survives well during a memory
shortages.

Zdenek


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-24 15:51:54
Message-ID: 12548.1251129114@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> ... We can see there that AVlauncher really wait 100ms, but it is not enough
> when system is under stress.

OK, thanks for checking that.

> I think that Alvaro's patch is good and it fix a crash problem. I also
> think that AVlauncher could wait little bit more then 100ms. When system
> cannot fork, I don't see any reason why hurry to repeat a fork
> operation. Maybe 1s is good compromise.

Seems reasonable to me. Alvaro, do you want to take care of applying
these three fixes?

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-24 15:54:03
Message-ID: 20090824155403.GC5962@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala wrote:

> Just a confirmation that Alvaro's patch+ReleasePostmasterChildSlot() fix
> solves the problem and PostgreSQL survives well during a memory
> shortages.

So this patch would do it.

I think this stuff about postmaster child slots is later than
launcher/worker split. I don't recall seeing it before.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment Content-Type Size
autovac-signal-2.patch text/x-diff 2.5 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-24 16:28:21
Message-ID: 14167.1251131301@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Zdenek Kotala wrote:
>> Just a confirmation that Alvaro's patch+ReleasePostmasterChildSlot() fix
>> solves the problem and PostgreSQL survives well during a memory
>> shortages.

> So this patch would do it.

Looks good to me, but I think you should also increase the avlauncher
delay from 100ms -> 1s as Zdenek suggested.

> I think this stuff about postmaster child slots is later than
> launcher/worker split. I don't recall seeing it before.

Yeah, I added that recently to try to detect postmaster children
that exit without cleaning up properly. I seem to have missed this
error case :-(. Actually it looks like fork failure for regular
backends gets it wrong too :-( :-( --- would you please also add one
in the pid<0 path in BackendStartup? I think this logic is in 8.4
too, in which case it'll need to be back-patched.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-24 17:26:10
Message-ID: 20090824172610.GE5962@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Zdenek Kotala wrote:
> >> Just a confirmation that Alvaro's patch+ReleasePostmasterChildSlot() fix
> >> solves the problem and PostgreSQL survives well during a memory
> >> shortages.
>
> > So this patch would do it.
>
> Looks good to me, but I think you should also increase the avlauncher
> delay from 100ms -> 1s as Zdenek suggested.

Committed with this change.

> > I think this stuff about postmaster child slots is later than
> > launcher/worker split. I don't recall seeing it before.
>
> Yeah, I added that recently to try to detect postmaster children
> that exit without cleaning up properly. I seem to have missed this
> error case :-(. Actually it looks like fork failure for regular
> backends gets it wrong too :-( :-( --- would you please also add one
> in the pid<0 path in BackendStartup? I think this logic is in 8.4
> too, in which case it'll need to be back-patched.

And this one too.

Thanks!

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-24 18:10:51
Message-ID: 26794.1251137451@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> Yeah, I added that recently to try to detect postmaster children
>> that exit without cleaning up properly. I seem to have missed this
>> error case :-(. Actually it looks like fork failure for regular
>> backends gets it wrong too :-( :-( --- would you please also add one
>> in the pid<0 path in BackendStartup? I think this logic is in 8.4
>> too, in which case it'll need to be back-patched.

> And this one too.

Almost right --- we only want to release for non-dead_end children.
I fixed that part.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-24 18:47:42
Message-ID: 20090824184741.GF5962@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Tom Lane wrote:
> >> Yeah, I added that recently to try to detect postmaster children
> >> that exit without cleaning up properly. I seem to have missed this
> >> error case :-(. Actually it looks like fork failure for regular
> >> backends gets it wrong too :-( :-( --- would you please also add one
> >> in the pid<0 path in BackendStartup? I think this logic is in 8.4
> >> too, in which case it'll need to be back-patched.
>
> > And this one too.
>
> Almost right --- we only want to release for non-dead_end children.
> I fixed that part.

Oops, sorry about that -- I even noticed it :-(

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support