Re: PostgreSQL crash on Freebsd 7

Lists: pgsql-bugs
From: Michael <michael(at)gameservice(dot)ru>
To: pgsql-bugs(at)postgresql(dot)org
Subject: PostgreSQL crash on Freebsd 7
Date: 2007-10-25 20:09:10
Message-ID: 952387397.20071025230910@gameservice.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hello.
I have problems with Postgres core dumping on FreeBSD7 (RELENG_7)

Here is backtrace from gdb postgres postgres.core:

(gdb) bt
#0 0x485dc277 in kill () from /lib/libc.so.7
#1 0x485dc1d6 in raise () from /lib/libc.so.7
#2 0x485dadda in abort () from /lib/libc.so.7
#3 0x0824c075 in errfinish ()
#4 0x0824c8b1 in elog_finish ()
#5 0x081c9184 in s_lock ()
#6 0x081c8d48 in LWLockAcquire ()
#7 0x081c61ec in LockAcquire ()
#8 0x081c4289 in LockRelationOid ()
#9 0x080938fc in relation_open ()
#10 0x08096d5a in index_open ()
#11 0x08096139 in systable_beginscan ()
#12 0x08134f10 in RelationBuildTriggers ()
#13 0x08245d4d in RelationCacheInitializePhase2 ()
#14 0x08256af0 in InitPostgres ()
#15 0x081cfd13 in PostgresMain ()
#16 0x081a90ec in ClosePostmasterPorts ()
#17 0x081a9ea7 in PostmasterMain ()
#18 0x0816912f in main ()

Extract from dmesg:
pid 30622 (postgres), uid 70: exited on signal 6 (core dumped)

Nothing interesting in other logs.

I run FreeBSD 7.0-BETA1 on Dual-Core AMD Opteron(tm) Processor 2216
(2394.01-MHz 686-class CPU) with ULE scheduler
PostgreSQL 8.2.5

I can't find what triggers this behavior (it started core dumping
after upgrading from FreeBSD 6.2)

Anyone have solution for this problem?

Michael


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Michael <michael(at)gameservice(dot)ru>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: PostgreSQL crash on Freebsd 7
Date: 2007-10-25 23:40:34
Message-ID: 20071025234034.GN23566@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Michael wrote:

> Extract from dmesg:
> pid 30622 (postgres), uid 70: exited on signal 6 (core dumped)
>
> Nothing interesting in other logs.
>
> I run FreeBSD 7.0-BETA1 on Dual-Core AMD Opteron(tm) Processor 2216
> (2394.01-MHz 686-class CPU) with ULE scheduler
> PostgreSQL 8.2.5
>
> I can't find what triggers this behavior (it started core dumping
> after upgrading from FreeBSD 6.2)

This probably means that the spinlock support is not up to speed for
your platform. It is strange though -- I think I've seen other people
using FreeBSD 7. I don't see any on the buildfarm:
http://buildfarm.postgresql.org/cgi-bin/show_status.pl

It probably means you'll need to do some hacking to make it work again.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael <michael(at)gameservice(dot)ru>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: PostgreSQL crash on Freebsd 7
Date: 2007-10-26 00:15:54
Message-ID: 29394.1193357754@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Michael <michael(at)gameservice(dot)ru> writes:
> Here is backtrace from gdb postgres postgres.core:

> (gdb) bt
> #0 0x485dc277 in kill () from /lib/libc.so.7
> #1 0x485dc1d6 in raise () from /lib/libc.so.7
> #2 0x485dadda in abort () from /lib/libc.so.7
> #3 0x0824c075 in errfinish ()
> #4 0x0824c8b1 in elog_finish ()
> #5 0x081c9184 in s_lock ()
> #6 0x081c8d48 in LWLockAcquire ()
> #7 0x081c61ec in LockAcquire ()

Apparently s_lock_stuck ... though you might want to look at
postmaster's stderr output to confirm that.

> I run FreeBSD 7.0-BETA1 on Dual-Core AMD Opteron(tm) Processor 2216
> (2394.01-MHz 686-class CPU) with ULE scheduler
> PostgreSQL 8.2.5

> I can't find what triggers this behavior (it started core dumping
> after upgrading from FreeBSD 6.2)

Did you recompile Postgres? Maybe you need to. I dunno what the
differences are between 6.2 and 7 ...

regards, tom lane


From: Michael <michael(at)gameservice(dot)ru>
To: Tom Lane <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: PostgreSQL crash on Freebsd 7
Date: 2007-10-26 00:45:05
Message-ID: 1172522311.20071026034505@gameservice.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

TL> Apparently s_lock_stuck ... though you might want to look at
TL> postmaster's stderr output to confirm that.
Yes, you are right
2007-10-25 23:37:12 MSD (u=picred,db=picred)PANIC: stuck spinlock (0x4880c3b0) detected at lwlock.c:379

TL> Did you recompile Postgres? Maybe you need to. I dunno what the
TL> differences are between 6.2 and 7 ...
Yes.

Michael


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael <michael(at)gameservice(dot)ru>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: PostgreSQL crash on Freebsd 7
Date: 2007-10-26 01:11:34
Message-ID: 29985.1193361094@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Michael <michael(at)gameservice(dot)ru> writes:
> 2007-10-25 23:37:12 MSD (u=picred,db=picred)PANIC: stuck spinlock (0x4880c3b0) detected at lwlock.c:379

You said this was an Opteron? Why is it printing only 32-bit addresses?

> TL> Did you recompile Postgres? Maybe you need to. I dunno what the
> TL> differences are between 6.2 and 7 ...
> Yes.

I'm thinking the rebuild broke somehow ... on the strength of
the above, maybe it's partially 32 and partially 64 bits. This could
have been pilot error on your part, or maybe FBSD7 wants some
new/different compile or link switches that our configuration code
doesn't know about.

Did you rebuild in a pre-existing PG build tree? If so, that might
have resulted in a partial rebuild that could create such a problem.
I'd suggest "make distclean", reconfigure, rebuild before you waste
any further human effort on the problem...

A slightly different thought is that you're likely using a beta gcc
release that has maybe got bugs. If decreasing the -O level helps,
I'd suspect that.

regards, tom lane


From: Michael <michael(at)gameservice(dot)ru>
To: Tom Lane <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: PostgreSQL crash on Freebsd 7
Date: 2007-10-26 01:29:21
Message-ID: 1822533920.20071026042921@gameservice.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

TL> You said this was an Opteron? Why is it printing only 32-bit addresses?
Yes, i'm using it in 32-bit mode

TL> Did you rebuild in a pre-existing PG build tree? If so, that might
TL> have resulted in a partial rebuild that could create such a problem.
TL> I'd suggest "make distclean", reconfigure, rebuild before you waste
TL> any further human effort on the problem...
I did portupgrade -fa, this command rebuilds all ports. I'll try to
recompile manually.

TL> A slightly different thought is that you're likely using a beta gcc
TL> release that has maybe got bugs. If decreasing the -O level helps,
TL> I'd suspect that.
gcc (GCC) 4.2.1 20070719 [FreeBSD]

i will try to compile without optimization

Michael


From: Michael <michael(at)gameservice(dot)ru>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: PostgreSQL crash on Freebsd 7
Date: 2007-11-01 18:50:01
Message-ID: 43877923.20071101205001@gameservice.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

I tried a clean rebuild as Tom Lane suggested, but this didn't help.
Anyone offers commercial support for solving this problem?

M> Hello.
M> I have problems with Postgres core dumping on FreeBSD7 (RELENG_7)

M> Here is backtrace from gdb postgres postgres.core:

M> (gdb) bt
M> #0 0x485dc277 in kill () from /lib/libc.so.7
M> #1 0x485dc1d6 in raise () from /lib/libc.so.7
M> #2 0x485dadda in abort () from /lib/libc.so.7
M> #3 0x0824c075 in errfinish ()
M> #4 0x0824c8b1 in elog_finish ()
M> #5 0x081c9184 in s_lock ()
M> #6 0x081c8d48 in LWLockAcquire ()
M> #7 0x081c61ec in LockAcquire ()
M> #8 0x081c4289 in LockRelationOid ()
M> #9 0x080938fc in relation_open ()
M> #10 0x08096d5a in index_open ()
M> #11 0x08096139 in systable_beginscan ()
M> #12 0x08134f10 in RelationBuildTriggers ()
M> #13 0x08245d4d in RelationCacheInitializePhase2 ()
M> #14 0x08256af0 in InitPostgres ()
M> #15 0x081cfd13 in PostgresMain ()
M> #16 0x081a90ec in ClosePostmasterPorts ()
M> #17 0x081a9ea7 in PostmasterMain ()
M> #18 0x0816912f in main ()

M> Extract from dmesg:
M> pid 30622 (postgres), uid 70: exited on signal 6 (core dumped)

M> Nothing interesting in other logs.

M> I run FreeBSD 7.0-BETA1 on Dual-Core AMD Opteron(tm) Processor 2216
M> (2394.01-MHz 686-class CPU) with ULE scheduler
M> PostgreSQL 8.2.5

M> I can't find what triggers this behavior (it started core dumping
M> after upgrading from FreeBSD 6.2)

M> Anyone have solution for this problem?

M> Michael

M> ---------------------------(end of
M> broadcast)---------------------------
M> TIP 7: You can help support the PostgreSQL project by donating at

M> http://www.postgresql.org/about/donate

Michael


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael <michael(at)gameservice(dot)ru>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: PostgreSQL crash on Freebsd 7
Date: 2007-11-01 21:08:30
Message-ID: 15294.1193951310@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Michael <michael(at)gameservice(dot)ru> writes:
> M> (gdb) bt
> M> #0 0x485dc277 in kill () from /lib/libc.so.7
> M> #1 0x485dc1d6 in raise () from /lib/libc.so.7
> M> #2 0x485dadda in abort () from /lib/libc.so.7
> M> #3 0x0824c075 in errfinish ()
> M> #4 0x0824c8b1 in elog_finish ()
> M> #5 0x081c9184 in s_lock ()
> M> #6 0x081c8d48 in LWLockAcquire ()
> M> #7 0x081c61ec in LockAcquire ()
> M> #8 0x081c4289 in LockRelationOid ()
> M> #9 0x080938fc in relation_open ()
> M> #10 0x08096d5a in index_open ()
> M> #11 0x08096139 in systable_beginscan ()
> M> #12 0x08134f10 in RelationBuildTriggers ()
> M> #13 0x08245d4d in RelationCacheInitializePhase2 ()
> M> #14 0x08256af0 in InitPostgres ()
> M> #15 0x081cfd13 in PostgresMain ()
> M> #16 0x081a90ec in ClosePostmasterPorts ()
> M> #17 0x081a9ea7 in PostmasterMain ()
> M> #18 0x0816912f in main ()

On closer look ... there is something awfully strange about this
backtrace. If it's gotten as far as RelationBuildTriggers, then this is
not the first spinlock acquisition in the life of this backend, nor the
first LWLockAcquire, nor even the first time to re-acquire a previously
released LWLock. Not to mention that the startup process must've
successfully done such things too. That seems to eliminate all of the
simple theories about how spinlocks might be broken.

How repeatable is this --- does it happen on every connection attempt,
or only sometimes? Can you start and stop the postmaster without
any problems being logged?

regards, tom lane


From: Michael <michael(at)gameservice(dot)ru>
To: Tom Lane <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: PostgreSQL crash on Freebsd 7
Date: 2007-11-01 21:23:54
Message-ID: 1983190064.20071101232354@gameservice.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

TL> How repeatable is this --- does it happen on every connection attempt,
TL> or only sometimes? Can you start and stop the postmaster without
TL> any problems being logged?
Only sometimes, 1-4 times per day under high load. Postmaster starts
and stops without problems.

Backtraces are a bit different from time to time, here is last:

(gdb) bt
#0 0x485d8277 in kill () from /lib/libc.so.7
#1 0x485d81d6 in raise () from /lib/libc.so.7
#2 0x485d6dda in abort () from /lib/libc.so.7
#3 0x0824694e in errfinish ()
#4 0x08247a43 in elog_finish ()
#5 0x081c565e in s_lock ()
#6 0x081c522e in LWLockAcquire ()
#7 0x081c15ff in LockRelease ()
#8 0x081c03d3 in UnlockRelationId ()
#9 0x08096824 in index_close ()
#10 0x08095afe in systable_endscan ()
#11 0x08131a88 in RelationBuildTriggers ()
#12 0x08241bc8 in RelationCacheInitializePhase2 ()
#13 0x08251afd in InitPostgres ()
#14 0x081cc789 in PostgresMain ()
#15 0x081a5270 in ClosePostmasterPorts ()
#16 0x081a6741 in PostmasterMain ()
#17 0x081650f2 in main ()

Michael


From: "Larry Rosenman" <ler(at)lerctr(dot)org>
To: "'Michael'" <michael(at)gameservice(dot)ru>, "'Tom Lane'" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: PostgreSQL crash on Freebsd 7
Date: 2007-11-01 22:03:25
Message-ID: 02d101c81cd3$0c911110$25b33330$@org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

do you have a repeatable test case?

I have a FreeBSD 7/amd64 box that I can do the following:

1) make test runs
2) make available to a developer.

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler(at)lerctr(dot)org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893

-----Original Message-----
From: pgsql-bugs-owner(at)postgresql(dot)org
[mailto:pgsql-bugs-owner(at)postgresql(dot)org] On Behalf Of Michael
Sent: Thursday, November 01, 2007 4:24 PM
To: Tom Lane
Subject: Re: [BUGS] PostgreSQL crash on Freebsd 7

TL> How repeatable is this --- does it happen on every connection attempt,
TL> or only sometimes? Can you start and stop the postmaster without
TL> any problems being logged?
Only sometimes, 1-4 times per day under high load. Postmaster starts
and stops without problems.

Backtraces are a bit different from time to time, here is last:

(gdb) bt
#0 0x485d8277 in kill () from /lib/libc.so.7
#1 0x485d81d6 in raise () from /lib/libc.so.7
#2 0x485d6dda in abort () from /lib/libc.so.7
#3 0x0824694e in errfinish ()
#4 0x08247a43 in elog_finish ()
#5 0x081c565e in s_lock ()
#6 0x081c522e in LWLockAcquire ()
#7 0x081c15ff in LockRelease ()
#8 0x081c03d3 in UnlockRelationId ()
#9 0x08096824 in index_close ()
#10 0x08095afe in systable_endscan ()
#11 0x08131a88 in RelationBuildTriggers ()
#12 0x08241bc8 in RelationCacheInitializePhase2 ()
#13 0x08251afd in InitPostgres ()
#14 0x081cc789 in PostgresMain ()
#15 0x081a5270 in ClosePostmasterPorts ()
#16 0x081a6741 in PostmasterMain ()
#17 0x081650f2 in main ()

Michael

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael <michael(at)gameservice(dot)ru>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: PostgreSQL crash on Freebsd 7
Date: 2007-11-01 22:32:57
Message-ID: 16553.1193956377@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Michael <michael(at)gameservice(dot)ru> writes:
> TL> How repeatable is this --- does it happen on every connection attempt,
> TL> or only sometimes? Can you start and stop the postmaster without
> TL> any problems being logged?

> Only sometimes, 1-4 times per day under high load. Postmaster starts
> and stops without problems.

You should have been clear about that to start with, because it
changes the likely nature of the problem entirely.

> Backtraces are a bit different from time to time, here is last:

Hmm, are they always within InitPostgres? That would be a bit odd,
because I can't see any reason why a recently-started process would
be more prone to a transient spinlock problem than any other process.

What seems like a reasonable bet at this point is that the FBSD7
kernel's scheduler has been changed in a way that makes it possible for
it to sometimes not schedule a process for a very long time (order of a
couple minutes). If that happened while the process was holding a
spinlock then other processes waiting to get the spinlock would fail
like this. Since we don't hold spinlocks long --- the maximum hold time
is supposed to be no more than a couple dozen instructions --- the
probability of this would be low. But under sufficient load maybe you'd
see it a few times a day. (What is "high load" to you, anyway?)

A different theory, given that you said you're using a dual-core
machine, is that we're seeing the effects of the two CPUs' caches
somehow getting out of sync. I could believe that a kernel problem
could cause that; wrong settings in the hardware page tables, for
instance. Dunno if you can afford the performance hit, but it would be
interesting to run for awhile with only one CPU active and see if the
problem still occurs.

Anyway, I think you probably need to get some FBSD kernel hackers
involved, because this sounds to me like it's their bug in one way
or another. Particularly since I now notice you mentioned that FBSD7
is only at beta1 stage ...

regards, tom lane