postmaster segfault when using SELECT on a table

Lists: pgsql-bugs
From: Karsten Desler <kd(at)link11(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Subject: postmaster segfault when using SELECT on a table
Date: 2008-04-26 10:39:06
Message-ID: 20080426103906.GY21811@soohrt.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Hello,

I have a smallish postgres database that segfaults everytime when I try to
access a certain row in a certain column.

xxx=# select file_id from dbfiles offset 632531 limit 1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

file_id is a varchar(40).

The database is a selfcompiled 8.2.6 on a x86_64 linux 2.6 machine
that was running well for about 8 months before it started exhibiting the
problem a couple days ago.
I tried upgrading to 8.2.7 but the problem is still occuring.

I have recompiled the postgres server without -O2 and with -g and have
captured a coredump. Here's a bt and the first section of a bt full.
If you need more information, please don't hesistate to contact me.

Core was generated by `postgres: postgres xxx [local] SELECT '.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000067c01b in pglz_decompress (source=0x2b3ab8060910, dest=0xa57744 "d") at pg_lzcompress.c:678
678 *bp = bp[-off];
(gdb) bt
#0 0x000000000067c01b in pglz_decompress (source=0x2b3ab8060910, dest=0xa57744 "d") at pg_lzcompress.c:678
#1 0x00000000004613b6 in heap_tuple_untoast_attr (attr=0x2b3ab8060910) at tuptoaster.c:128
#2 0x00000000006a3a19 in pg_detoast_datum (datum=0x2b3ab8060910) at fmgr.c:1973
#3 0x00000000004423d0 in printtup (slot=0xa3ff78, self=0xa3de00) at printtup.c:317
#4 0x000000000053d184 in ExecSelect (slot=0xa3ff78, dest=0xa3de00, estate=0xa3fe00) at execMain.c:1310
#5 0x000000000053cfea in ExecutePlan (estate=0xa3fe00, planstate=0xa40120, operation=CMD_SELECT, numberTuples=0, direction=ForwardScanDirection, dest=0xa3de00) at execMain.c:1236
#6 0x000000000053b9aa in ExecutorRun (queryDesc=0xa335d0, direction=ForwardScanDirection, count=0) at execMain.c:241
#7 0x00000000005f9f4c in PortalRunSelect (portal=0xa4b6c0, forward=1 '\001', count=0, dest=0xa3de00) at pquery.c:831
#8 0x00000000005f9bc3 in PortalRun (portal=0xa4b6c0, count=9223372036854775807, dest=0xa3de00, altdest=0xa3de00, completionTag=0x7fff01602ac0 "") at pquery.c:656
#9 0x00000000005f4737 in exec_simple_query (query_string=0xa05100 "select file_id from dbfiles offset 632531 limit 1;") at postgres.c:939
#10 0x00000000005f8376 in PostgresMain (argc=4, argv=0x9672c8, username=0x967290 "postgres") at postgres.c:3424
#11 0x00000000005c4318 in BackendRun (port=0x9634d0) at postmaster.c:2934
#12 0x00000000005c38c5 in BackendStartup (port=0x9634d0) at postmaster.c:2561
#13 0x00000000005c154a in ServerLoop () at postmaster.c:1214
#14 0x00000000005c0f70 in PostmasterMain (argc=3, argv=0x946230) at postmaster.c:966
#15 0x0000000000568d76 in main (argc=3, argv=0x946230) at main.c:188
(gdb) bt full
#0 0x000000000067c01b in pglz_decompress (source=0x2b3ab8060910, dest=0xa57744 "d") at pg_lzcompress.c:678
dp = (const unsigned char *) 0x2b3ab80707ea "`"
dend = (const unsigned char *) 0x2b3abff0c03d "6.21.163"
bp = (unsigned char *) 0xa7b000 <Address 0xa7b000 out of bounds>
ctrl = 3 '\003'
ctrlc = 5
len = 5
off = 1633
destsize = 0
__func__ = "pglz_decompress"
--snip

Thanks in advance,
Karsten Desler


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Karsten Desler <kd(at)link11(dot)de>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: postmaster segfault when using SELECT on a table
Date: 2008-04-26 17:24:56
Message-ID: 13142.1209230696@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Karsten Desler <kd(at)link11(dot)de> writes:
> I have a smallish postgres database that segfaults everytime when I try to
> access a certain row in a certain column.

Looks like a corrupted-data issue to me. It might be interesting to
dump the page with pg_filedump and see if there's any apparent pattern
to the damage.

regards, tom lane


From: Karsten Desler <kd(at)link11(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: postmaster segfault when using SELECT on a table
Date: 2008-04-26 18:20:03
Message-ID: 20080426182003.GA24990@soohrt.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

* Tom Lane wrote:
> Karsten Desler <kd(at)link11(dot)de> writes:
> > I have a smallish postgres database that segfaults everytime when I try to
> > access a certain row in a certain column.
>
> Looks like a corrupted-data issue to me. It might be interesting to
> dump the page with pg_filedump and see if there's any apparent pattern
> to the damage.

Thanks, I'll try to play with pg_filedump later tonight.
I've never had problems with this (and many more) postgres servers regarding
corruption of on disk data structures and I'm perfectly fine with chalking it
up to hardware problems.

I don't know much about the postgres architecture and I don't know if bounds
checking on-disk values on a read makes a lot of sense since usually one
should be able to assume that there are no randomly flipped bits; but it
would've been nice to have a sensible log entry as to what really
happened.

Anyway, for future reference: Assuming that this is the only corruption,
can I just UPDATE (or DELETE and reINSERT) the offending entry (maybe with a
following REINDEX/VACUUM?) or do I need to restore a backup?
If possible, I'd prefer the UPDATE solution, of course, since it can be done
without any downtime.

Keep up the good work.

Best regards,
Karsten Desler


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Karsten Desler <kd(at)link11(dot)de>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: postmaster segfault when using SELECT on a table
Date: 2008-04-26 19:29:32
Message-ID: 19964.1209238172@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Karsten Desler <kd(at)link11(dot)de> writes:
> I don't know much about the postgres architecture and I don't know if bounds
> checking on-disk values on a read makes a lot of sense since usually one
> should be able to assume that there are no randomly flipped bits; but it
> would've been nice to have a sensible log entry as to what really
> happened.

FWIW, there is code in CVS HEAD that detects simple cases of corrupt
compressed data, though it's anyone's guess if it would've caught your
example here.

> Anyway, for future reference: Assuming that this is the only corruption,
> can I just UPDATE (or DELETE and reINSERT) the offending entry (maybe with a
> following REINDEX/VACUUM?) or do I need to restore a backup?

If only the one row is clobbered, you should be able to just delete and
re-insert it, assuming you can identify it in a way that doesn't crash
in itself (ctid is probably about the safest). Not sure if an UPDATE
would be safe.

My suspicion though is that you'll find that a large portion of that
page is damaged; that's usually what we've seen in such cases in the
past.

regards, tom lane


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Karsten Desler <kd(at)link11(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: postmaster segfault when using SELECT on a table
Date: 2008-04-28 11:24:58
Message-ID: 4815B40A.8090800@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Karsten Desler wrote:
>
> I don't know much about the postgres architecture and I don't know if bounds
> checking on-disk values on a read makes a lot of sense since usually one
> should be able to assume that there are no randomly flipped bits; but it
> would've been nice to have a sensible log entry as to what really
> happened.

I attached backported patch from head to 8.2. You can try it. It has small
performance penalty, but it does not crash on corrupted data.

Zdenek

Attachment Content-Type Size
pg_lzcompress.patch text/x-patch 4.1 KB

From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Karsten Desler <kd(at)link11(dot)de>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: postmaster segfault when using SELECT on a table
Date: 2008-04-28 11:31:20
Message-ID: 4815B588.5090502@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Tom Lane wrote:
> My suspicion though is that you'll find that a large portion of that
> page is damaged; that's usually what we've seen in such cases in the
> past.

I think, It can happen only if corruption is less then TOAST chunk size. In
other case, page header or tuple header+chunk id should be corrupted and it
should be reported in another place.

Zdenek


From: Karsten Desler <kd(at)link11(dot)de>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: postmaster segfault when using SELECT on a table
Date: 2008-04-29 11:51:40
Message-ID: 20080429115140.GI24990@soohrt.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

* Zdenek Kotala wrote:
> Karsten Desler wrote:
> >
> >I don't know much about the postgres architecture and I don't know if
> >bounds
> >checking on-disk values on a read makes a lot of sense since usually one
> >should be able to assume that there are no randomly flipped bits; but it
> >would've been nice to have a sensible log entry as to what really
> >happened.
>
> I attached backported patch from head to 8.2. You can try it. It has small
> performance penalty, but it does not crash on corrupted data.

Thank you very much! I have restored a backup of the corrupt postgres
data files on a second server and I can confirm that postmaster no
longer crashes with the patch applied.

Thanks,
Karsten