Re: 9.1.1 hot standby startup gets sigbus

Lists: pgsql-bugs
From: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
To: Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: 9.1.1 hot standby startup gets sigbus
Date: 2011-12-01 02:10:46
Message-ID: CAK3UJRG_WncD1jxh+tA9=UXA2oB7Zsspa6bTt6JzDDM7bFs+1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hi all,

I have a new 9.1.1 hot standby machine (VM). It consistently starts
up, goes through recovery through these same WAL segments, and then
gets a "Bus error":

2011-11-30 20:57:37 EST LOG: database system was interrupted while in
recovery at log time 2011-11-30 09:37:11 EST
2011-11-30 20:57:37 EST HINT: If this has occurred more than once
some data might be corrupted and you might need to choose an earlier
recovery target.
2011-11-30 20:57:37 EST LOG: entering standby mode, pausing.
2011-11-30 20:57:57 EST LOG: restored log file
"000000010000005D00000045" from archive
2011-11-30 20:57:58 EST LOG: redo starts at 5D/4560B61C
2011-11-30 20:57:59 EST LOG: restored log file
"000000010000005D00000046" from archive
2011-11-30 20:58:00 EST LOG: restored log file
"000000010000005D00000047" from archive
2011-11-30 20:58:03 EST LOG: restored log file
"000000010000005D00000048" from archive
2011-11-30 20:58:04 EST LOG: restored log file
"000000010000005D00000049" from archive
2011-11-30 20:58:05 EST LOG: restored log file
"000000010000005D0000004A" from archive
2011-11-30 20:58:06 EST LOG: restored log file
"000000010000005D0000004B" from archive
2011-11-30 20:58:07 EST LOG: restored log file
"000000010000005D0000004C" from archive
2011-11-30 20:58:07 EST LOG: startup process (PID 23167) was
terminated by signal 7: Bus error
2011-11-30 20:58:07 EST LOG: terminating any other active server processes

After I saw the same problem a few times in a row, I stuck a
sleep(20); call right after the "entering standby mode" message (hence
the modified message "entering standby mode, pausing"), so I could
have a few seconds to attach to the startup process with gdb and get a
backtrace; I'm not sure why, but a core dump wasn't produced
automatically, even though I think I've set ulimit appropriately.
Anyway, here is the backtrace:

(gdb) continue
Continuing.

Program received signal SIGBUS, Bus error.
0x0825b329 in ReadBuffer_common (smgr=0xa0be758, relpersistence=112 'p',
forkNum=MAIN_FORKNUM, blockNum=98560, mode=RBM_ZERO, strategy=0x0,
hit=0xbfa0403f "") at /usr/include/bits/string3.h:85
85 return __builtin___memset_chk (__dest, __ch, __len, __bos0 (__dest));
(gdb) bt
#0 0x0825b329 in ReadBuffer_common (smgr=0xa0be758, relpersistence=112 'p',
forkNum=MAIN_FORKNUM, blockNum=98560, mode=RBM_ZERO, strategy=0x0,
hit=0xbfa0403f "") at /usr/include/bits/string3.h:85
#1 0x0825bcdf in ReadBufferWithoutRelcache (rnode=..., forkNum=MAIN_FORKNUM,
blockNum=98560, mode=RBM_ZERO, strategy=0x0) at bufmgr.c:272
#2 0x080d4998 in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM,
blkno=98560, mode=RBM_ZERO) at xlogutils.c:296
#3 0x080c6b85 in RestoreBkpBlocks (lsn=..., record=0xa0b53c8, cleanup=0 '\000')
at xlog.c:3589
#4 0x080dd294 in gin_redo (lsn=..., record=0xa0b53c8) at ginxlog.c:676
#5 0x080d1759 in StartupXLOG () at xlog.c:6599
#6 0x080d42a5 in StartupProcessMain () at xlog.c:10028
#7 0x080eb637 in AuxiliaryProcessMain (argc=2, argv=<value optimized out>)
at bootstrap.c:417
#8 0x0823e5ac in StartChildProcess (type=StartupProcess) at postmaster.c:4498
#9 0x08240afe in PostmasterMain (argc=3, argv=0xa083838) at postmaster.c:1111
#10 0x081e1a88 in main (argc=3, argv=0xa083838) at main.c:199
(gdb)

Some more details about my setup:
The hot standby is an Ubuntu 11.04 32-bit OpenVZ VM. The master is a
Debian Squeeze 32-bit VM. Both master and hot standby are running
9.1.1 compiled from source with:
./configure --prefix=/home/postgres/runtime/ --enable-debug --with-python

The master is archiving WAL files to the hot standby using rsync. I
initialized $PGDATA on the standby using a simple:
rsync -avz postgres(at)master:/home/postgres/pgdata /home/postgres/

I'll attach my non-default pg_settings and recovery.conf file as well
for completeness. I notice gin_redo() in the backtrace; I am using a
full-text index like this:
"xxx_txtsearch" gin (to_tsvector('english'::regconfig, title || body))

on a large partitioned table.

Since the standby VM is new (I've used the same hosting provider
before, but this is the first time I tried an Ubuntu VM), it's
entirely possible I messed something up in the set-up. Or it could be
some flakiness with the VM itself.

But before I start blowing things away and trying to start from
scratch, I figured I'd leave the machine alone for a bit in case this
report is interesting and any more sleuthing would help.

Josh

Attachment Content-Type Size
pg_settings.txt text/plain 1.9 KB
recovery.conf application/octet-stream 219 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
Cc: Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: 9.1.1 hot standby startup gets sigbus
Date: 2011-12-01 02:17:18
Message-ID: 5628.1322705838@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Josh Kupershmidt <schmiddy(at)gmail(dot)com> writes:
> I have a new 9.1.1 hot standby machine (VM). It consistently starts
> up, goes through recovery through these same WAL segments, and then
> gets a "Bus error":

Can you try that on 9.1 branch tip to see if it's already fixed?

regards, tom lane


From: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: 9.1.1 hot standby startup gets sigbus
Date: 2011-12-01 03:36:51
Message-ID: CAK3UJREzQsnJWZyCQMfxiDgYA+5wHKT7xhKRBO2P4eNcO7ES0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Wed, Nov 30, 2011 at 9:17 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Can you try that on 9.1 branch tip to see if it's already fixed?

Hrm, don't think that helped - I get the same error in the logs using
a checkout of branch REL9_1_STABLE.

Josh


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
Cc: Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: 9.1.1 hot standby startup gets sigbus
Date: 2011-12-01 03:50:36
Message-ID: 7793.1322711436@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Josh Kupershmidt <schmiddy(at)gmail(dot)com> writes:
> On Wed, Nov 30, 2011 at 9:17 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Can you try that on 9.1 branch tip to see if it's already fixed?

> Hrm, don't think that helped - I get the same error in the logs using
> a checkout of branch REL9_1_STABLE.

Well, was worth a try, although the stack trace didn't look very close
to any of the GIN or hot-standby fixes that I can remember offhand.

Not sure what to make of it. If the stack trace can be believed,
it's crashing in one of the memsets in ReadBuffer_common() ... which
ought to be impossible, because they're fixed-length and always pointed
at a buffer.

It might be worth recompiling at -O0, first to see if that changes the
behavior and second to see if it changes the reported stack trace.

regards, tom lane


From: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: 9.1.1 hot standby startup gets sigbus
Date: 2011-12-01 14:06:57
Message-ID: CAK3UJRGPXAzGO-R5tS40T9d9-FzYNWzdX0MGa4kM4L2D8Q8qNQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Wed, Nov 30, 2011 at 10:50 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Josh Kupershmidt <schmiddy(at)gmail(dot)com> writes:
>> On Wed, Nov 30, 2011 at 9:17 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> It might be worth recompiling at -O0, first to see if that changes the
> behavior and second to see if it changes the reported stack trace.

Here's a backtrace from 9.1 tip with CFLAGS="-O0", server logs are as before:

Program received signal SIGBUS, Bus error.
0x00c9716b in ?? () from /lib/i386-linux-gnu/libc.so.6
(gdb) bt
#0 0x00c9716b in ?? () from /lib/i386-linux-gnu/libc.so.6
#1 0x082b3b2f in ReadBuffer_common (smgr=0x9ef68d0, relpersistence=112 'p',
forkNum=MAIN_FORKNUM, blockNum=86781, mode=RBM_ZERO, strategy=0x0,
hit=0xbfc5eeff "") at bufmgr.c:435
#2 0x082b3722 in ReadBufferWithoutRelcache (rnode=..., forkNum=MAIN_FORKNUM,
blockNum=86781, mode=RBM_ZERO, strategy=0x0) at bufmgr.c:272
#3 0x080df3a0 in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM,
blkno=86781, mode=RBM_ZERO) at xlogutils.c:296
#4 0x080d0bab in RestoreBkpBlocks (lsn=..., record=0x9eed480,
cleanup=0 '\000') at xlog.c:3589
#5 0x080eb1b8 in gin_redo (lsn=..., record=0x9eed480) at ginxlog.c:678
#6 0x080d7d91 in StartupXLOG () at xlog.c:6601
#7 0x080de215 in StartupProcessMain () at xlog.c:10043
#8 0x080fa737 in AuxiliaryProcessMain (argc=2, argv=0xbfc5f188)
at bootstrap.c:417
#9 0x082941a2 in StartChildProcess (type=StartupProcess) at postmaster.c:4499
#10 0x0828fcf5 in PostmasterMain (argc=3, argv=0x9ebb8e0) at postmaster.c:1112
#11 0x08218b64 in main (argc=3, argv=0x9ebb8e0) at main.c:199

Josh


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
Cc: Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: 9.1.1 hot standby startup gets sigbus
Date: 2011-12-01 17:53:56
Message-ID: 8242.1322762036@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Josh Kupershmidt <schmiddy(at)gmail(dot)com> writes:
> Here's a backtrace from 9.1 tip with CFLAGS="-O0", server logs are as before:

> Program received signal SIGBUS, Bus error.
> 0x00c9716b in ?? () from /lib/i386-linux-gnu/libc.so.6
> (gdb) bt
> #0 0x00c9716b in ?? () from /lib/i386-linux-gnu/libc.so.6
> #1 0x082b3b2f in ReadBuffer_common (smgr=0x9ef68d0, relpersistence=112 'p',
> forkNum=MAIN_FORKNUM, blockNum=86781, mode=RBM_ZERO, strategy=0x0,
> hit=0xbfc5eeff "") at bufmgr.c:435
> #2 0x082b3722 in ReadBufferWithoutRelcache (rnode=..., forkNum=MAIN_FORKNUM,
> blockNum=86781, mode=RBM_ZERO, strategy=0x0) at bufmgr.c:272
> #3 0x080df3a0 in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM,
> blkno=86781, mode=RBM_ZERO) at xlogutils.c:296
> #4 0x080d0bab in RestoreBkpBlocks (lsn=..., record=0x9eed480,
> cleanup=0 '\000') at xlog.c:3589
> #5 0x080eb1b8 in gin_redo (lsn=..., record=0x9eed480) at ginxlog.c:678

[ scratches head... ] Well, that proves the previous backtrace wasn't
lying to us, but it's hard to see how this happens. AFAICT this could
only mean that the buffer pointer returned by BufHdrGetBlock is bad,
and how could that happen? I'm suspicious that there must be something
broken about your VM environment. Can you reproduce it outside the VM,
or even better create a reproducible test case?

regards, tom lane


From: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: 9.1.1 hot standby startup gets sigbus
Date: 2011-12-02 14:31:55
Message-ID: CAK3UJREBcyVBtr8D7vMfU=uDdkjXkrPnGcuy8erYB0tMfKe1LA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

[Moving back on-list. Tom generously offered to look at the server in
question, since it seemed likely that a testcase would be difficult or
impossible to reproduce in this case]

On Fri, Dec 2, 2011 at 12:07 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Well, poking around in the process at the moment of SIGBUS, I find that
> it's computed a buffer address of
>
>        (gdb) p bufBlock
>        $12 = (Block) 0xb6c49060
>
> which appears to be perfectly reasonable: it's working with buffer
> header number 15514 and that's equal to BufferBlocks + (15514 * 8192).
> However, trying to dump that 8K area out stops before it gets to the
> end:
>
>        0xb6c4afc0:     0       0       0       0
>        0xb6c4afd0:     0       0       0       0
>        0xb6c4afe0:     0       0       0       0
>        0xb6c4aff0:     0       0       0       0
>        0xb6c4b000:     Cannot access memory at address 0xb6c4b000
>
>        (gdb) p 0xb6c4b000 - 0xb6c49060
>        $13 = 8096
>
> that is, there isn't any memory mapped beyond 0xb6c4b000, which is why
> we're getting SIGBUS.  So a significant fraction of the intended shared
> buffer array (16384 buffers) simply Ain't There.  The reason it crashes
> at this exact spot seems to be simply that this is where we've used up
> enough of the buffers to need one that's not (all) there.
>
> So the question is, has Postgres gone nuts about how big it needs its
> shared memory to be?  I don't think so.  Looking into the memory map
> for the process, we've got
>
> postgres(at)S04001011820ASA:~$ cat /proc/11187/maps
> 00110000-00112000 r-xp 00000000 fd:04 412754092                          /lib/i386-linux-gnu/libdl-2.13.so
> 00112000-00113000 r--p 00001000 fd:04 412754092                          /lib/i386-linux-gnu/libdl-2.13.so
> 00113000-00114000 rw-p 00002000 fd:04 412754092                          /lib/i386-linux-gnu/libdl-2.13.so
> 00247000-0026b000 r-xp 00000000 fd:04 412754075                          /lib/i386-linux-gnu/libm-2.13.so
> 0026b000-0026c000 r--p 00023000 fd:04 412754075                          /lib/i386-linux-gnu/libm-2.13.so
> 0026c000-0026d000 rw-p 00024000 fd:04 412754075                          /lib/i386-linux-gnu/libm-2.13.so
> 0026d000-003c7000 r-xp 00000000 fd:04 412754120                          /lib/i386-linux-gnu/libc-2.13.so
> 003c7000-003c8000 ---p 0015a000 fd:04 412754120                          /lib/i386-linux-gnu/libc-2.13.so
> 003c8000-003ca000 r--p 0015a000 fd:04 412754120                          /lib/i386-linux-gnu/libc-2.13.so
> 003ca000-003cb000 rw-p 0015c000 fd:04 412754120                          /lib/i386-linux-gnu/libc-2.13.so
> 003cb000-003ce000 rw-p 00000000 00:00 0
> 00611000-0062d000 r-xp 00000000 fd:04 412754069                          /lib/i386-linux-gnu/ld-2.13.so
> 0062d000-0062e000 r--p 0001b000 fd:04 412754069                          /lib/i386-linux-gnu/ld-2.13.so
> 0062e000-0062f000 rw-p 0001c000 fd:04 412754069                          /lib/i386-linux-gnu/ld-2.13.so
> 00995000-0099f000 r-xp 00000000 fd:04 412754174                          /lib/i386-linux-gnu/libnss_files-2.13.so
> 0099f000-009a0000 r--p 00009000 fd:04 412754174                          /lib/i386-linux-gnu/libnss_files-2.13.so
> 009a0000-009a1000 rw-p 0000a000 fd:04 412754174                          /lib/i386-linux-gnu/libnss_files-2.13.so
> 00e57000-00e58000 r-xp 00000000 00:00 0                                  [vdso]
> 08048000-08548000 r-xp 00000000 fd:04 419563409                          /home/postgres/runtime/bin/postgres
> 08548000-08549000 r--p 004ff000 fd:04 419563409                          /home/postgres/runtime/bin/postgres
> 08549000-08551000 rw-p 00500000 fd:04 419563409                          /home/postgres/runtime/bin/postgres
> 08551000-08591000 rw-p 00000000 00:00 0
> 08c03000-08c5d000 rw-p 00000000 00:00 0                                  [heap]
> aed2f000-b7503000 rw-s 00000000 91:8f 1212416
> b7503000-b7703000 r--p 00000000 fd:04 412486363                          /usr/lib/locale/locale-archive
> b7703000-b7705000 rw-p 00000000 00:00 0
> b7709000-b770a000 r--p 002dd000 fd:04 412486363                          /usr/lib/locale/locale-archive
> b770a000-b770c000 rw-p 00000000 00:00 0
> bff75000-bff97000 rw-p 00000000 00:00 0                                  [stack]
> postgres(at)S04001011820ASA:~$
>
> ie, there's alleged to be shared memory from 0xaed2f000 up to
> 0xb7503000, and that squares exactly with where PG thinks its
> shared memory is:
>
> (gdb) p UsedShmemSegAddr
> $34 = (void *) 0xaed2f000
> (gdb) p *(PGShmemHeader *) UsedShmemSegAddr
> $35 = {magic = 679834894, creatorPID = 11185, totalsize = 142426112, freeoffset = 142125504, index = 0xaedafb48,
>  device = 37259, inode = 414974343}
> (gdb) p 0xb7503000 - 0xaed2f000
> $36 = 142426112
> (gdb)
>
> So basically, you've got a broken kernel here: it claimed to give PG
> circa 135MB of memory, but what's actually there is only about 128MB.
> I don't see any connection between those numbers and the shmmax/shmall
> settings, either --- so I think this must be some busted implementation
> of a VM-level limitation.
>
> I see no Postgres bug here.  You need to take this up with your hosting
> provider, who have given you a faulty kernel.
>
>                        regards, tom lane
>

On Fri, Dec 2, 2011 at 1:03 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> ... btw, if I'm right in guessing that this is just an
> arbitrarily-imposed limit on the amount of shared memory available to
> Postgres, then knocking down shared_buffers by a quarter or so ought
> to do as a workaround, till you figure out what's going on.
>
>                        regards, tom lane
>

Wow, that's interesting, though I can't say I'm completely surprised.
You were spot on about turning down shared_buffers - I'm trying it at
96MB, down from 128MB, and the recovery process is chugging along.
I'll probably just ditch this VM and hosting provider (chvps aka
privatelayer, in case anyone wants to stay away).

Thanks for the investigation!

Josh