My investigations of the postmaster Bus error

Lists: pgsql-bugspgsql-patches
From: Martin Pitt <martin(at)piware(dot)de>
To: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: My investigations of the postmaster Bus error
Date: 2005-10-11 19:13:15
Message-ID: 20051011191315.GB11868@piware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Hi PostgreSQL developers!

There have already been some reports about the mysterious Bus error
that postmaster dies with on some architectures. Since that bites
pretty hard, I did some investigations and tests on various
architectures with various configurations.

As background, Debian currently builds with gcc 4.0.2 by default, and
I use the latest 7.4.9 and 8.0.4 PostgreSQL versions. The default is
to build with -O2.

Here are the results:

* On i386, PowerPC, AMD 64, S/390, arm, and Alpha all versions work
fine with all tested compiler versions (gcc 3.3.3 and 4.0.2).

* On IA 64, HP PARISC, and sparc postmaster 7.4 and 8.0 fail with a
bus error when ran from initdb. It works fine as soon as I

- build with gcc 3.3 or
- build with -O0 or
- run postmaster through initdb under gdb (grumpf) or
- run postmaster through initdb under strace or
- run postmaster directly (not through initdb).

Yay Heisenbugs. :-/

Also, at least 8.1 on sparc works also well with gcc 4.0 and -O2.

* And then there is MIPS, which really sucks. It constantly crashes
in all configurations I tried it with:

8.0 with gcc-4.0 -O2
8.0 with gcc-4.0 -O0
8.0 with gcc-3.3 -O2
8.0 with gcc-3.3 -O2 and --disable-spinlocks
7.4 with gcc-4.0 -O2 original without any patches
7.4 with gcc-3.3 -O2 with recent MIPS spinlock patch

This also produces an usable backtrace:

Starting program:
/home/mpitt/8.0/postgresql-8.0-8.0.3/debian/tmp/usr/lib/postgresql/8.0/bin/postmaster

Program received signal SIGBUS, Bus error.
0x006e4f80 in InitializeGUCOptions () at guc.c:2360
2360 *conf->variable =
conf->reset_val;
(gdb) bt
#0 0x006e4f80 in InitializeGUCOptions () at guc.c:2360
#1 0x005c7f68 in PostmasterMain (argc=1, argv=0x100539e0) at postmaster.c:439
#2 0x0056f874 in main (argc=1, argv=0x100539e0) at main.c:268

Some weeks ago I tracked down the particular variable it fails on
(some float variable; unfortunately I forgot the name, but if it is
important, I can redo the research), but I did not find any
datatype mismatch or similar obvious things.

Does anybody have an idea about these bus errors? Also, if somebody
wants to track down the MIPS bug: I can offer temporary ssh access to
a Debian sid with all required build dependencies, gdb, and the like
for debugging.

Thanks and have a nice day!

Martin

--
Martin Pitt http://www.piware.de
Ubuntu Developer http://www.ubuntu.com
Debian Developer http://www.debian.org

In a world without walls and fences, who needs Windows and Gates?


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: My investigations of the postmaster Bus error
Date: 2005-10-11 23:10:29
Message-ID: 20051011231029.GI23883@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

gerbil started failing with bus errors some time ago. We were finally
able to 'fix it' by clearing out the CVS checkout, but the first
failure could have been legitimate. See
http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=gerbil&dt=2005-08-26%2009:18:41

Hope this helps...

On Tue, Oct 11, 2005 at 09:13:15PM +0200, Martin Pitt wrote:
> Hi PostgreSQL developers!
>
> There have already been some reports about the mysterious Bus error
> that postmaster dies with on some architectures. Since that bites
> pretty hard, I did some investigations and tests on various
> architectures with various configurations.
>
> As background, Debian currently builds with gcc 4.0.2 by default, and
> I use the latest 7.4.9 and 8.0.4 PostgreSQL versions. The default is
> to build with -O2.
>
> Here are the results:
>
> * On i386, PowerPC, AMD 64, S/390, arm, and Alpha all versions work
> fine with all tested compiler versions (gcc 3.3.3 and 4.0.2).
>
> * On IA 64, HP PARISC, and sparc postmaster 7.4 and 8.0 fail with a
> bus error when ran from initdb. It works fine as soon as I
>
> - build with gcc 3.3 or
> - build with -O0 or
> - run postmaster through initdb under gdb (grumpf) or
> - run postmaster through initdb under strace or
> - run postmaster directly (not through initdb).
>
> Yay Heisenbugs. :-/
>
> Also, at least 8.1 on sparc works also well with gcc 4.0 and -O2.
>
> * And then there is MIPS, which really sucks. It constantly crashes
> in all configurations I tried it with:
>
> 8.0 with gcc-4.0 -O2
> 8.0 with gcc-4.0 -O0
> 8.0 with gcc-3.3 -O2
> 8.0 with gcc-3.3 -O2 and --disable-spinlocks
> 7.4 with gcc-4.0 -O2 original without any patches
> 7.4 with gcc-3.3 -O2 with recent MIPS spinlock patch
>
> This also produces an usable backtrace:
>
> Starting program:
> /home/mpitt/8.0/postgresql-8.0-8.0.3/debian/tmp/usr/lib/postgresql/8.0/bin/postmaster
>
> Program received signal SIGBUS, Bus error.
> 0x006e4f80 in InitializeGUCOptions () at guc.c:2360
> 2360 *conf->variable =
> conf->reset_val;
> (gdb) bt
> #0 0x006e4f80 in InitializeGUCOptions () at guc.c:2360
> #1 0x005c7f68 in PostmasterMain (argc=1, argv=0x100539e0) at postmaster.c:439
> #2 0x0056f874 in main (argc=1, argv=0x100539e0) at main.c:268
>
> Some weeks ago I tracked down the particular variable it fails on
> (some float variable; unfortunately I forgot the name, but if it is
> important, I can redo the research), but I did not find any
> datatype mismatch or similar obvious things.
>
> Does anybody have an idea about these bus errors? Also, if somebody
> wants to track down the MIPS bug: I can offer temporary ssh access to
> a Debian sid with all required build dependencies, gdb, and the like
> for debugging.
>
> Thanks and have a nice day!
>
> Martin
>
> --
> Martin Pitt http://www.piware.de
> Ubuntu Developer http://www.ubuntu.com
> Debian Developer http://www.debian.org
>
> In a world without walls and fences, who needs Windows and Gates?

--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Martin Pitt <martin(at)piware(dot)de>
Cc: Bugs <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] My investigations of the postmaster Bus error
Date: 2005-12-22 13:03:49
Message-ID: 20051222130349.GA20830@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Hey Martin,

I've been playing with the MIPS machine a little and still haven't found
any _obvious_ cause for the problem. However I suspect that it may be
related to unaligned memory access, which _I think_ results in a SIGBUS
on MIPS. I haven't found any documentation on MIPS that would confirm
this however. I'm not sure exactly how would this by worked around; it
occurs to me that we'd have to change config_real to look like

struct config_real
{
enum {
struct config_generic gen;
double dummy;
} field1;

/* these fields must be set correctly in initial value: */
/* (all but reset_val are constants) */
double *variable;
...
}

though I'm not sure and I haven't tested it. (Of course a working patch
needs to change a few more places.) I'll do some more experiments and
I'll let you know.

mpitt(at)reset:/tmp/pgsql8.0.4$ gdb bin/postgres
GNU gdb 6.3-debian
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "mips-linux"...Using host libthread_db library "/lib/libthread_db.so.1".

(gdb) set args -boot
(gdb) run
Starting program: /tmp/pgsql8.0.4/bin/postgres -boot

Program received signal SIGBUS, Bus error.
0x00818c38 in InitializeGUCOptions () at guc.c:2360
2360 *conf->variable = conf->reset_val;
(gdb) bt
#0 0x00818c38 in InitializeGUCOptions () at guc.c:2360
#1 0x004a8fc0 in BootstrapMain (argc=2, argv=0x10053998) at bootstrap.c:244
#2 0x005f4dc4 in main (argc=2, argv=0x10053998) at main.c:296
(gdb) print *conf
$1 = {gen = {name = 0x8c4484 "geqo_selection_bias", context = PGC_USERSET, group = QUERY_TUNING_GEQO,
short_desc = 0x8c4498 "GEQO: selective pressure within the population.", long_desc = 0x0, flags = 0, vartype = PGC_REAL,
status = 0, reset_source = PGC_S_DEFAULT, tentative_source = PGC_S_DEFAULT, source = PGC_S_DEFAULT, stack = 0x0},
variable = 0x100136d2, reset_val = 2, min = 1.5, max = 2, assign_hook = 0, show_hook = 0, tentative_val = 0}

--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
"La grandeza es una experiencia transitoria. Nunca es consistente.
Depende en gran parte de la imaginación humana creadora de mitos"
(Irulan)


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Martin Pitt <martin(at)piware(dot)de>, Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: My investigations of the postmaster Bus error
Date: 2005-12-22 13:56:43
Message-ID: 20051222135643.GA21448@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

[Sorry for copying -patches in my last email, I actually meant to send
it to pgsql-bugs]

Alvaro Herrera wrote:

> I've been playing with the MIPS machine a little and still haven't found
> any _obvious_ cause for the problem. However I suspect that it may be
> related to unaligned memory access, which _I think_ results in a SIGBUS
> on MIPS.

However, this may turn out to be a red herring, because the variables
are allocated in the data segment and not by malloc, so I think it's
pretty hard to believe there's any unaligned acccess. A small program
that simulates what Postgres is doing here is attached, and it doesn't
fail with SIGBUS, which is rather what I'd expect. There may be
something different in the way Postgres does things, but I haven't been
able to find what. Suggestions welcome.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Attachment Content-Type Size
sigbus.c text/x-csrc 920 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Martin Pitt <martin(at)piware(dot)de>, Bugs <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] My investigations of the postmaster Bus error
Date: 2005-12-22 17:07:08
Message-ID: 23734.1135271228@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> I've been playing with the MIPS machine a little and still haven't found
> any _obvious_ cause for the problem. However I suspect that it may be
> related to unaligned memory access, which _I think_ results in a SIGBUS
> on MIPS. I haven't found any documentation on MIPS that would confirm
> this however. I'm not sure exactly how would this by worked around; it
> occurs to me that we'd have to change config_real to look like

I don't think so --- to believe that the GUC data structures aren't
adequately aligned, you'd have to explain why PG doesn't crash on other
architectures that require 8-byte alignment of doubles, eg HPPA.

regards, tom lane