Postgresql 8.4.1 segfault, backtrace

Lists: pgsql-bugs
From: Richard Neill <rn214(at)cam(dot)ac(dot)uk>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Postgresql 8.4.1 segfault, backtrace
Date: 2009-09-24 06:13:55
Message-ID: 4ABB0E23.1010704@cam.ac.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Dear All,

I've just upgraded from 8.4.0 to 8.4.1 because of a segfault in 8.4, and
we've found that this is still happening repeatedly in 8.4.1. We're in a
bit of a bind, as this is a production system, and we get segfaults
every few hours.

[It's a testament to how good the postgres crash recovery is that, with
a reasonably small value of checkpoint_segments = 4, recovery happens in
30 seconds, and the warehouse systems seem to continue OK].

The version I'm using is 8.4.1, in the source package provided for
Ubuntu Karmic, compiled by me on a 64-bit server (running Ubuntu Jaunty).

I'm not sufficiently expert to debug it very far, but I wonder whether
the following info from GDB would help one of the hackers here (I've
trimmed out the uninteresting bits):

------------
$ gdb /usr/lib/postgresql/8.4/bin/postgres core.200909030901
GNU gdb 6.8-debian

This GDB was configured as "x86_64-linux-gnu"...

Core was generated by `postgres: fensys fswcs [local] startup
'.
Program terminated with signal 11, Segmentation fault.
[New process 14965]
#0 RelationCacheInitializePhase2 () at relcache.c:2654
2654 if (relation->rd_rel->relhasrules &&
relation->rd_rules == NULL)
(gdb) bt
#0 RelationCacheInitializePhase2 () at relcache.c:2654
#1 0x00007f61355a1021 in InitPostgres (in_dbname=0x7f613788c610
"fswcs", dboid=0, username=0x7f6137889450 "fensys", out_dbname=0x0) at
postinit.c:576
#2 0x00007f61354dbcc5 in PostgresMain (argc=4, argv=0x7f6137889480,
username=0x7f6137889450 "fensys") at postgres.c:3334
#3 0x00007f61354aefdd in ServerLoop () at postmaster.c:3447
#4 0x00007f61354afecc in PostmasterMain (argc=5, argv=0x7f6137885140)
at postmaster.c:1040
#5 0x00007f61354568ce in main (argc=5, argv=0x7f6137885140) at main.c:188
(gdb) quit
-------------

A few more bits of info:

The backtrace points to line 2654 in relcache.c, in
RelationCacheInitializePhase2()

There is a NULL dereference of "relation"

=> needNewCacheFile = false
criticalRelcachesBuilt = true

=> nothing is happening before it enters the failure code block.

I can give you a core dump if anyone would like to see it, but it's 405
MB after bzipping.

One last observation: a dump and restore of the DB seems to prevent it
crashing for about a day.

Thank you for your help,

Richard


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Richard Neill <rn214(at)cam(dot)ac(dot)uk>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Postgresql 8.4.1 segfault, backtrace
Date: 2009-09-24 15:16:06
Message-ID: 23820.1253805366@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Richard Neill <rn214(at)cam(dot)ac(dot)uk> writes:
> I've just upgraded from 8.4.0 to 8.4.1 because of a segfault in 8.4, and
> we've found that this is still happening repeatedly in 8.4.1.

Oh dear. I just got an off-list report that seems to point to the same
kind of thing.

> The backtrace points to line 2654 in relcache.c, in
> RelationCacheInitializePhase2()

> There is a NULL dereference of "relation"

> => needNewCacheFile = false
> criticalRelcachesBuilt = true

> => nothing is happening before it enters the failure code block.

<spock>Fascinating.</spock>

I think this must mean that corrupt data is being read from the relcache
init file. The reason a restart fixes it is probably that restart
forcibly removes the old init file, which is good for recovery but not
so good for finding out what's wrong. Could you modify
RelationCacheInitFileRemove (at the bottom of relcache.c) to rename the
file someplace else instead of deleting it? And then send me a copy
of the bad file once you have one?

> I can give you a core dump if anyone would like to see it, but it's 405
> MB after bzipping.

Not going to help anyone else anyway, since it's uninterpretable without
a duplicate system. (If you have a spare machine with the same OS and
the same postgres executables, maybe you could put the core file on that
and let me ssh in to have a look?)

> One last observation: a dump and restore of the DB seems to prevent it
> crashing for about a day.

Do you have any maintenance operations that touch the system catalogs
(like maybe a forced REINDEX)? Can you correlate the crashes with any
activity of that sort?

BTW, the other reporter claimed that the problem went away after
building with asserts+debug. I'm not sure I believe that, especially
seeing that you evidently have debug on. But if you don't have asserts
enabled, please rebuild with them and see if that changes anything.

regards, tom lane