Another PANIC corrupt index/crash ...any thoughts?

Lists: pgsql-general
From: Jeff Amiel <becauseimjeff(at)yahoo(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Another PANIC corrupt index/crash ...any thoughts?
Date: 2010-02-01 14:45:36
Message-ID: 433140.68997.qm@web65505.mail.ac4.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

About a month ago I posted about a database crash possibly caused by corrupt index..

Dec 30 17:41:57 db-1 postgres[28957]: [ID 748848 local0.crit] [34004622-1] 2009-12-30 17:41:57.825 CST 28957PANIC: right sibling 2019 of block 2018 is not next child of 1937 in index "sl_log_2_idx1"

Has since happened again with a DIFFERENT index (interestingly also a slony related index)

Jan 29 15:17:42 db-1 postgres[29025]: [ID 748848 local0.crit] [4135622-1] 2010-01-29 15:17:42.915 CST 29025PANIC: right sibling 183 of block 182 is not next child of 158 in index "sl_seqlog_idx"

I re-indexed the table.......and restarted the database and all appears well (shut down autovacuum and slony for a while first to get feet underneath and then restarted after a few hours with no apparent ill effects)

Coincidentally (or not) started getting disk errors about a minute AFTER the above error (db storage is on a fibre attached SAN)

/var/log/archive/log-2010-01-29.log:Jan 29 15:18:50 db-1 scsi_vhci: [ID 734749 kern.warning] WARNING: vhci_scsi_reset 0x1
/var/log/archive/log-2010-01-29.log:Jan 29 15:18:50 db-1 scsi: [ID 243001 kern.warning] WARNING: /pci(at)0,0/pci10de,5d(at)d/pci1077,142(at)0/fp(at)0,0 (fcp1):
/var/log/archive/log-2010-01-29.log:Jan 29 15:18:52 db-1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk(at)g000b08001c001958 (sd9):
/var/log/archive/log-2010-01-29.log:Jan 29 15:18:52 db-1 scsi: [ID 107833 kern.notice] Requested Block: 206265378 Error Block: 206265378
/var/log/archive/log-2010-01-29.log:Jan 29 15:18:52 db-1 scsi: [ID 107833 kern.notice] Vendor: Pillar Serial Number:
/var/log/archive/log-2010-01-29.log:Jan 29 15:18:52 db-1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
/var/log/archive/log-2010-01-29.log:Jan 29 15:18:52 db-1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0

Stack trace from recent crash is below:

Program terminated with signal 6, Aborted.
#0 0xfed00c57 in _lwp_kill () from /lib/libc.so.1
(gdb) bt
#0 0xfed00c57 in _lwp_kill () from /lib/libc.so.1
#1 0xfecfe40e in thr_kill () from /lib/libc.so.1
#2 0xfecad083 in raise () from /lib/libc.so.1
#3 0xfec90b19 in abort () from /lib/libc.so.1
#4 0x0821b6ea in errfinish (dummy=0) at elog.c:471
#5 0x0821c58f in elog_finish (elevel=22, fmt=0x82b7200 "right sibling %u of block %u is not next child of %u in index \"%s\"") at elog.c:964
#6 0x0809e0a8 in _bt_pagedel (rel=0x8602f78, buf=377580, stack=0x881d660, vacuum_full=0 '\0') at nbtpage.c:1141
#7 0x0809f73d in btvacuumscan (info=0x8043f60, stats=0x8578410, callback=0, callback_state=0x0, cycleid=20894) at nbtree.c:936
#8 0x0809fb6d in btbulkdelete (fcinfo=0x0) at nbtree.c:547
#9 0x0821f268 in FunctionCall4 (flinfo=0x0, arg1=0, arg2=0, arg3=0, arg4=0) at fmgr.c:1215
#10 0x0809a7a7 in index_bulk_delete (info=0x8043f60, stats=0x0, callback=0x812fea0 <lazy_tid_reaped>, callback_state=0x85765e8) at indexam.c:573
#11 0x0812fe2c in lazy_vacuum_index (indrel=0x8602f78, stats=0x85769c8, vacrelstats=0x85765e8) at vacuumlazy.c:660
#12 0x08130432 in lazy_vacuum_rel (onerel=0x8602140, vacstmt=0x85d9f48) at vacuumlazy.c:487
#13 0x0812e7e8 in vacuum_rel (relid=140353352, vacstmt=0x85d9f48, expected_relkind=114 'r') at vacuum.c:1107
#14 0x0812f832 in vacuum (vacstmt=0x85d9f48, relids=0x85d9f38) at vacuum.c:400
#15 0x08186cee in AutoVacMain (argc=0, argv=0x0) at autovacuum.c:914
#16 0x08187150 in autovac_start () at autovacuum.c:178
#17 0x0818bec5 in ServerLoop () at postmaster.c:1252
#18 0x0818d045 in PostmasterMain (argc=3, argv=0x83399a8) at postmaster.c:966
#19 0x08152ba6 in main (argc=3, argv=0x83399a8) at main.c:188

Any thoughts on how I should proceed?
We are planning an upgrade to 8.4 in the short-term, but I can see no evidence of fixes since the 8.2 version that would relate to index corruption. I have no real evidence of bad disks...iostat -E reports:

# iostat -E
sd2 Soft Errors: 1 Hard Errors: 4 Transport Errors: 0
Vendor: Pillar Product: Axiom 300 Revision: 0000 Serial No:
Size: 2.20GB <2200567296 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 4 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
sd3 Soft Errors: 1 Hard Errors: 32 Transport Errors: 0
Vendor: Pillar Product: Axiom 300 Revision: 0000 Serial No:
Size: 53.95GB <53948448256 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 32 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
sd7 Soft Errors: 1 Hard Errors: 40 Transport Errors: 8
Vendor: Pillar Product: Axiom 300 Revision: 0000 Serial No:
Size: 53.95GB <53948448256 bytes>
Media Error: 0 Device Not Ready: 1 No Device: 33 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
sd8 Soft Errors: 1 Hard Errors: 34 Transport Errors: 0
Vendor: Pillar Product: Axiom 300 Revision: 0000 Serial No:
Size: 107.62GB <107622432256 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 34 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
sd9 Soft Errors: 1 Hard Errors: 32 Transport Errors: 2
Vendor: Pillar Product: Axiom 300 Revision: 0000 Serial No:
Size: 215.80GB <215796153856 bytes>
Media Error: 0 Device Not Ready: 1 No Device: 29 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0

Any insight would be appreciated.
PostgreSQL 8.2.12 on i386-pc-solaris2.10, compiled by GCC gcc (GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath)


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Jeff Amiel <becauseimjeff(at)yahoo(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Another PANIC corrupt index/crash ...any thoughts?
Date: 2010-02-01 15:22:13
Message-ID: dcc563d11002010722j1f76067fl211a2cc56d45c56d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, Feb 1, 2010 at 7:45 AM, Jeff Amiel <becauseimjeff(at)yahoo(dot)com> wrote:
> About a month ago I posted about a database crash possibly caused by corrupt index..
> Coincidentally (or not) started getting disk errors about a minute AFTER the above error (db storage is on a fibre attached SAN)

Not likely a coincidence.

> /var/log/archive/log-2010-01-29.log:Jan 29 15:18:50 db-1 scsi_vhci: [ID 734749 kern.warning] WARNING: vhci_scsi_reset 0x1
> /var/log/archive/log-2010-01-29.log:Jan 29 15:18:50 db-1 scsi: [ID 243001 kern.warning] WARNING: /pci(at)0,0/pci10de,5d(at)d/pci1077,142(at)0/fp(at)0,0 (fcp1):
> /var/log/archive/log-2010-01-29.log:Jan 29 15:18:52 db-1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk(at)g000b08001c001958 (sd9):
> /var/log/archive/log-2010-01-29.log:Jan 29 15:18:52 db-1 scsi: [ID 107833 kern.notice]  Requested Block: 206265378                 Error Block: 206265378
> /var/log/archive/log-2010-01-29.log:Jan 29 15:18:52 db-1 scsi: [ID 107833 kern.notice]  Vendor: Pillar                             Serial Number:
> /var/log/archive/log-2010-01-29.log:Jan 29 15:18:52 db-1 scsi: [ID 107833 kern.notice]  Sense Key: Unit Attention
> /var/log/archive/log-2010-01-29.log:Jan 29 15:18:52 db-1 scsi: [ID 107833 kern.notice]  ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
>
> Any thoughts on how I should proceed?

Figure out what's broken in your hardware? It looks like a driver issue to me.

> We are planning an upgrade to 8.4 in the short-term, but I can see no evidence of fixes since the 8.2 version that would relate to index corruption.

This is not a postgresql issue, it is a bad hardware / driver issue.
PostgreSQL cannot cause a SCSI reset etc on its own, it requires
something be broken in the OS / hardware for that to happen.

> I have no real evidence of bad disks...iostat -E reports:

No, but this iostat output is evidence of a bad SAN driver / SAN or
something around there.

>
> # iostat -E
> sd2       Soft Errors: 1 Hard Errors: 4 Transport Errors: 0
> Vendor: Pillar   Product: Axiom 300        Revision: 0000 Serial No:
> Size: 2.20GB <2200567296 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 4 Recoverable: 0
> Illegal Request: 1 Predictive Failure Analysis: 0
> sd3       Soft Errors: 1 Hard Errors: 32 Transport Errors: 0
> Vendor: Pillar   Product: Axiom 300        Revision: 0000 Serial No:
> Size: 53.95GB <53948448256 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 32 Recoverable: 0
> Illegal Request: 1 Predictive Failure Analysis: 0
> sd7       Soft Errors: 1 Hard Errors: 40 Transport Errors: 8
> Vendor: Pillar   Product: Axiom 300        Revision: 0000 Serial No:
> Size: 53.95GB <53948448256 bytes>
> Media Error: 0 Device Not Ready: 1 No Device: 33 Recoverable: 0
> Illegal Request: 1 Predictive Failure Analysis: 0
> sd8       Soft Errors: 1 Hard Errors: 34 Transport Errors: 0
> Vendor: Pillar   Product: Axiom 300        Revision: 0000 Serial No:
> Size: 107.62GB <107622432256 bytes>
> Media Error: 0 Device Not Ready: 0 No Device: 34 Recoverable: 0
> Illegal Request: 1 Predictive Failure Analysis: 0
> sd9       Soft Errors: 1 Hard Errors: 32 Transport Errors: 2
> Vendor: Pillar   Product: Axiom 300        Revision: 0000 Serial No:
> Size: 215.80GB <215796153856 bytes>
> Media Error: 0 Device Not Ready: 1 No Device: 29 Recoverable: 0
> Illegal Request: 1 Predictive Failure Analysis: 0


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Jeff Amiel <becauseimjeff(at)yahoo(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Another PANIC corrupt index/crash ...any thoughts?
Date: 2010-02-01 15:26:08
Message-ID: dcc563d11002010726k48b691c2i5c40c5d94b11308@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, Feb 1, 2010 at 7:45 AM, Jeff Amiel <becauseimjeff(at)yahoo(dot)com> wrote:
>  I have no real evidence of bad disks...iostat -E reports:

Note that on a SAN you're not likely to see anything in iostat that
says "bad disk block" since the SAN is hiding all that from you and
presenting all the disks in it as one big disk, and the SAN will be
handling things like disk block errors. You need to use your SAN
management tools to troubleshoot this, most likely, assuming it's not
a driver issue.