Re: Chasing "signal 11" issues

Lists: pgsql-general
From: "Tass Chapman" <tasseh(dot)postgres(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Chasing "signal 11" issues
Date: 2006-03-30 13:02:08
Message-ID: ec9575e80603300502l573df7afqb33a223cdd1886ee@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Since Monday I have been seeing "terminated by signal 11" messages in my
7.4.6 + Slon 1.0.5 system,. but only on the master

I've done a dumapall, initdb and restore , which reduced the frequency but I
still get them 6-8 times a day.

After turning up logging it seemed to die when calling a very small table (2
rows, 4 columns, 8 char text strings), but manually selecting caused no
issues, so I then took a hit and shutdown the system and swapped out the RAM
(from earlier list suggestions).

This seemed to work until 7 hours later when the problem has reappeared, at
a higher frequency too.

It is ONLY occuring on the master, not on any of the leaf (replicated)
nodes, and seems to be triggered by a few different systems connecting (so
no common code base)

Suggestions/help ?


From: Douglas McNaught <doug(at)mcnaught(dot)org>
To: "Tass Chapman" <tasseh(dot)postgres(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Chasing "signal 11" issues
Date: 2006-03-30 13:16:39
Message-ID: 871wwkkr7s.fsf@suzuka.mcnaught.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Tass Chapman" <tasseh(dot)postgres(at)gmail(dot)com> writes:

> Since Monday I have been seeing "terminated by signal 11" messages
> in my 7.4.6 + Slon 1.0.5 system,. but only on the master

This kind of thing is almost always a hardware problem. 'memtest86'
is probably a good first step, and see if any of your cooling fans
hanve stopped working.

-Doug


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Douglas McNaught <doug(at)mcnaught(dot)org>
Cc: "Tass Chapman" <tasseh(dot)postgres(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Chasing "signal 11" issues
Date: 2006-03-30 14:34:46
Message-ID: 14445.1143729286@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Douglas McNaught <doug(at)mcnaught(dot)org> writes:
> "Tass Chapman" <tasseh(dot)postgres(at)gmail(dot)com> writes:
>> Since Monday I have been seeing "terminated by signal 11" messages
>> in my 7.4.6 + Slon 1.0.5 system,. but only on the master

> This kind of thing is almost always a hardware problem. 'memtest86'
> is probably a good first step, and see if any of your cooling fans
> hanve stopped working.

If nothing about the software or the workload have changed recently,
I'd agree with Doug about what to look at. Otherwise ... 7.4.6 is
pretty old and we have fixed a number of problems since then. Even
if you don't have the energy to migrate to 8.* now, there's very little
excuse for not dropping in the latest 7.4 subrelease (7.4.12 I think).

regards, tom lane


From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: Tass Chapman <tasseh(dot)postgres(at)gmail(dot)com>
Cc: pgsql general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Chasing "signal 11" issues
Date: 2006-03-30 15:35:27
Message-ID: 1143732927.26940.11.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, 2006-03-30 at 07:02, Tass Chapman wrote:
> Since Monday I have been seeing "terminated by signal 11" messages in
> my 7.4.6 + Slon 1.0.5 system,. but only on the master
>
> I've done a dumapall, initdb and restore , which reduced the frequency
> but I still get them 6-8 times a day.
>
> After turning up logging it seemed to die when calling a very small
> table (2 rows, 4 columns, 8 char text strings), but manually selecting
> caused no issues, so I then took a hit and shutdown the system and
> swapped out the RAM (from earlier list suggestions).
>
> This seemed to work until 7 hours later when the problem has
> reappeared, at a higher frequency too.
>
> It is ONLY occuring on the master, not on any of the leaf (replicated)
> nodes, and seems to be triggered by a few different systems connecting
> (so no common code base)

As mentioned earlier, this tends to be caused by hardware. Note that it
can be caused by buggy software or corrupted binaries as well.

It is possible that the binaries you're running on have become corrupted
in some small way. You might want to run md5sum across all the binaries
(postgresql, slony, etc...) on the bad and good machine and compare
them.

If the problem is in the hardware, and I think it is, it could be
anywhere, bad drive, raid controller, raid cache, scsi interface, CPU,
memory, and so on.so, memtest86 might find the problem if it's mainboard
/ CPU / memory, but if it's an I/O problem, it won't.

The most common failures are mechanical in nature. I've had machines
that were crashing, and all I had to do was reseat the CPU or memory or
heat sink and suddenly it was running fine.

However, you need to switch over to your failover machine immediately.
Running your main database on what is most likely faulty hardware is a
recipe for corruption of your database.