Re: pg_dumpall failing from possible corrupted shared memory

Lists: pgsql-general
From: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
To: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-26 03:07:49
Message-ID: 114069.9815.qm@web31804.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I am trying to perform a complete dump of my database. However, I am getting an error when
pg_dumpall reaches one particular database. I reformatted the text so that it wouldn't get
chopped to pieces after it is sent.

Here is the output from pg_dumpall:

postgres(at)db_server01~ $ pg_dumpall > test.dump
pg_dump: SQL command failed
pg_dump: Error message from server:
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_dump: The command was:
SELECT t.tableoid,
t.oid,
t.relname as indexname,
pg_catalog.pg_get_indexdef(i.indexrelid) as indexdef,
t.relnatts as indnkeys,
i.indkey,
i.indisclustered,
c.contype,
c.conname,
c.tableoid as contableoid,
c.oid as conoid,
(SELECT spcname
FROM pg_catalog.pg_tablespace s
WHERE s.oid = t.reltablespace) as tablespace
FROM pg_catalog.pg_index i
JOIN pg_catalog.pg_class t
ON (t.oid = i.indexrelid)
LEFT JOIN pg_catalog.pg_depend d
ON (d.classid = t.tableoid
AND d.objid = t.oid
AND d.deptype = 'i')
LEFT JOIN pg_catalog.pg_constraint c
ON (d.refclassid = c.tableoid
AND d.refobjid = c.oid)
WHERE i.indrelid = '16737'::pg_catalog.oid
ORDER BY indexname
pg_dumpall: pg_dump failed on database "mydb", exiting

Here is the output from the logs when these errors occurs minus the repeated above query:

10-25 18:51:43 PDT% LOG: server process (PID 12899) was
terminated by signal 11
10-25 18:51:43 PDT% LOG: terminating any other active
server processes
10-25 18:51:43 PDT%postgres WARNING: terminating connection
because of crash of another server process
10-25 18:51:43 PDT%postgres DETAIL:
The postmaster has commanded this server process to
roll back the current transaction and exit, because
another server process exited abnormally and possibly
corrupted shared memory.
10-25 18:51:43 PDT%postgres HINT:
In a moment you should be able to reconnect to the
database and repeat your command.
10-25 18:51:43 PDT% LOG: all server processes terminated;
reinitializing
10-25 18:51:43 PDT% LOG: database system was interrupted
at 2006-10-25 18:50:42 PDT
10-25 18:51:43 PDT% LOG: checkpoint record is at 0/8E2DBA08
10-25 18:51:43 PDT% LOG: redo record is at 0/8E2DBA08;
undo record is at 0/0; shutdown TRUE
10-25 18:51:43 PDT% LOG: next transaction ID: 168758;
next OID: 621178
10-25 18:51:43 PDT% LOG: next MultiXactId: 45;
next MultiXactOffset: 95
10-25 18:51:43 PDT% LOG: database system was not properly shut down;
automatic recovery in progress
10-25 18:51:43 PDT% LOG: record with zero length at 0/8E2DBA4C
10-25 18:51:43 PDT% LOG: redo is not required
10-25 18:51:43 PDT% LOG: database system is ready
10-25 18:51:43 PDT% LOG: transaction ID wrap limit is 1073790580,
limited by database "postgres"

Any help is appreciated.

Regards,

Richard Broersma Jr.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-26 04:35:21
Message-ID: 8230.1161837321@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Richard Broersma Jr <rabroersma(at)yahoo(dot)com> writes:
> 10-25 18:51:43 PDT% LOG: server process (PID 12899) was
> terminated by signal 11

Stack trace from coredump, please? And exactly which PG version is
this?

regards, tom lane


From: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-26 05:05:06
Message-ID: 575649.49718.qm@web31813.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> > 10-25 18:51:43 PDT% LOG: server process (PID 12899) was
> > terminated by signal 11
>
> Stack trace from coredump, please? And exactly which PG version is
> this?

The version is 8.1.4 on:
Linux version 2.6.17-gentoo-r7 (root(at)db_server01) (gcc version 4.1.1 (Gentoo 4.1.1)) #8 Sun Oct 8
20:28:34 PDT 2006

However,

I did not find any coredump files. Will I need to rebuild postgres to produce coredumps?

Regards,

Richard Broersma Jr.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-26 13:57:17
Message-ID: 13003.1161871037@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Richard Broersma Jr <rabroersma(at)yahoo(dot)com> writes:
> I did not find any coredump files. Will I need to rebuild postgres to produce coredumps?

No, you just need to make sure the postmaster is started under
"ulimit -c unlimited", not "ulimit -c 0" which is the default
under many Linuxen. Adding this to the start script is the
easiest way usually.

If you can't get anything but numbers from gdb's stack trace
then you have a "stripped" executable and you will need to
rebuild to get a useful stack trace.

regards, tom lane


From: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-26 23:58:14
Message-ID: 950644.56008.qm@web31812.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> No, you just need to make sure the postmaster is started under
> "ulimit -c unlimited", not "ulimit -c 0" which is the default
> under many Linuxen. Adding this to the start script is the
> easiest way usually.
>
> If you can't get anything but numbers from gdb's stack trace
> then you have a "stripped" executable and you will need to
> rebuild to get a useful stack trace.

You'll have to forgive me. I've never used gdb before, I am not sure I am executing gdb with the
correct syntax. However, I was able to produce a core file. Here is the syntax that I used that
produced the most output - althought it doesn't look like what you would want.

postgres(at)db_server01 /home/data $ gdb postgres ./core
GNU gdb 6.4
Copyright 2005 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib/libthread_db.so.1".

warning: core file may not match specified executable file.
(no debugging symbols found)
Core was generated by `postgres: postgres mydb [local] SELECT '.
Program terminated with signal 11, Segmentation fault.

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/libpam.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpam.so.0
Reading symbols from /usr/lib/libssl.so.0.9.7...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libssl.so.0.9.7
Reading symbols from /usr/lib/libcrypto.so.0.9.7...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libcrypto.so.0.9.7
Reading symbols from /lib/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libz.so.1
Reading symbols from /lib/libreadline.so.5...(no debugging symbols found)...done.
Loaded symbols for /lib/libreadline.so.5
Reading symbols from /lib/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/libncurses.so.5...(no debugging symbols found)...done.
Loaded symbols for /lib/libncurses.so.5
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_compat.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_compat.so.2
Reading symbols from /lib/libnss_nis.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_nis.so.2
Reading symbols from /lib/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_files.so.2
#0 0xb7cc871b in strtouq () from /lib/libc.so.6
gdb> stack
#0 0xb7cc871b in strtouq () from /lib/libc.so.6
#1 0xb7cc847f in __strtol_internal () from /lib/libc.so.6
#2 0x08178404 in nodeToString ()
#3 0x081794fa in parseNodeString ()
#4 0x0817b2ff in nodeRead ()
#5 0x0817b16f in nodeRead ()
#6 0x081780ed in nodeToString ()
#7 0x081795fa in parseNodeString ()
#8 0x0817b2ff in nodeRead ()
#9 0x0817b16f in nodeRead ()
#10 0x0817b839 in stringToNode ()
#11 0x082264fb in pg_get_constraintdef ()
#12 0x0822667e in pg_get_indexdef ()
#13 0x08144ff5 in ExecProject ()
#14 0x081449ef in ExecProject ()
#15 0x08153aa5 in ExecMergeJoin ()
#16 0x081441bf in ExecProcNode ()
#17 0x081552f5 in ExecSort ()
#18 0x081441ed in ExecProcNode ()
#19 0x08143574 in ExecutorRun ()
#20 0x081dcbd9 in PostgresMain ()
#21 0x081dd1d8 in PortalRun ()
#22 0x081d8fab in pg_parse_and_rewrite ()
#23 0x081db364 in PostgresMain ()
#24 0x081a7db4 in ClosePostmasterPorts ()
#25 0x081a8ecc in PostmasterMain ()
#26 0x08165629 in main ()
gdb>

Please let me know what I can do to produce output that is more inline with what you expect.

Thanks for the help.

Regards,

Richard Broersma Jr.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 00:28:23
Message-ID: 27547.1161908903@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Richard Broersma Jr <rabroersma(at)yahoo(dot)com> writes:
> You'll have to forgive me. I've never used gdb before, I am not sure I
> am executing gdb with the correct syntax. However, I was able to
> produce a core file. Here is the syntax that I used that produced the
> most output - althought it doesn't look like what you would want.

No, this is good --- it'd be better with debug symbols, but there's
enough here to give us an idea where the problem is:

> gdb> stack
> #0 0xb7cc871b in strtouq () from /lib/libc.so.6
> #1 0xb7cc847f in __strtol_internal () from /lib/libc.so.6
> #2 0x08178404 in nodeToString ()
> #3 0x081794fa in parseNodeString ()
> #4 0x0817b2ff in nodeRead ()
> #5 0x0817b16f in nodeRead ()
> #6 0x081780ed in nodeToString ()
> #7 0x081795fa in parseNodeString ()
> #8 0x0817b2ff in nodeRead ()
> #9 0x0817b16f in nodeRead ()
> #10 0x0817b839 in stringToNode ()
> #11 0x082264fb in pg_get_constraintdef ()
> #12 0x0822667e in pg_get_indexdef ()
> #13 0x08144ff5 in ExecProject ()
> #14 0x081449ef in ExecProject ()
> #15 0x08153aa5 in ExecMergeJoin ()

What this looks like is that there's a mangled expression string in
the catalog entry for either a check constraint or an expression index.
The backend is dying while trying to decompile the expression for
pg_dump.

What I suggest doing is turning on log_statement = all and running
pg_dump, so that you can see the query it's dying on. That would give
you enough info to look into the catalog entry for the constraint or
index at hand. Worst-case, you can probably fix things by dropping and
recreating the constraint or index ... but before you do that, I'd urge
you to try to get as much info as you can about the nature of the
catalog corruption. If there's a bug here, as opposed to random
cosmic-ray damage, we can't fix it without more info.

regards, tom lane


From: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 02:09:12
Message-ID: 820639.80330.qm@web31804.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> What this looks like is that there's a mangled expression string in
> the catalog entry for either a check constraint or an expression index.
> The backend is dying while trying to decompile the expression for
> pg_dump.

you were write, I appears that my problem is with one of four functional date_trunc indexs that I
created for a timespace column.

Here is a greatly simplified query that still succeeds:

mydb=# select indexrelid from pg_index where indrelid = '16737';
indexrelid
------------
604243
604251
604252
604253
604254
(5 rows)

However, when I add the following function it breaks:

mydb=# select pg_get_indexdef(indexrelid) from pg_index where indrelid = '16737';
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

> What I suggest doing is turning on log_statement = all and running
> pg_dump, so that you can see the query it's dying on. That would give
> you enough info to look into the catalog entry for the constraint or
> index at hand. Worst-case, you can probably fix things by dropping and
> recreating the constraint or index ...log_min_messages = info but before you do that, I'd urge
> you to try to get as much info as you can about the nature of the
> catalog corruption. If there's a bug here, as opposed to random
> cosmic-ray damage, we can't fix it without more info.

The logs aren't really providing anymore information since I was already using "all". However, I
turned log_min_messages from "warning to "info".

%mydb LOG: statement:
select pg_get_indexdef(indexrelid)
from pg_index
where indrelid = '16737';
% LOG: server process (PID 16099) was terminated by signal 11
% LOG: terminating any other active server processes
% LOG: all server processes terminated;
reinitializing
% LOG: database system was interrupted at 2006-10-26 18:09:09 PDT
% LOG: checkpoint record is at 0/8E9161A8
% LOG: redo record is at 0/8E9161A8;
undo record is at 0/0;
shutdown TRUE
% LOG: next transaction ID: 174591; next OID: 621178
% LOG: next MultiXactId: 53; next MultiXactOffset: 115
% LOG: database system was not properly shut down;
automatic recovery in progress
%mydb FATAL: the database system is starting up
% LOG: record with zero length at 0/8E9161EC
% LOG: redo is not required
% LOG: database system is ready
% LOG: transaction ID wrap limit is 1073790580,
limited by database "postgres"

Thanks for the help Tom.

Regards,

Richard Broersma Jr.


From: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
To: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 02:46:32
Message-ID: 20061027024632.10511.qmail@web31811.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> >Worst-case, you can probably fix things by dropping and
> > recreating the constraint or index ...log_min_messages = info but before you do that, I'd
> urge
> > you to try to get as much info as you can about the nature of the
> > catalog corruption. If there's a bug here, as opposed to random
> > cosmic-ray damage, we can't fix it without more info.

I eliminated the non-offending index with this query:

select pg_get_indexdef(indexrelid)
from pg_index
where indexrelid <> 604251 -- this is the index with the problem
order by indexrelid;

How do I go about determining if the crash i caused by faulty hardware or a possible bug?

Regards,

Richard Broersma Jr.


From: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
To: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 04:00:30
Message-ID: 20061027040030.27878.qmail@web31803.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> > > but before you do that, I'd urge
> > > you to try to get as much info as you can about the nature of the
> > > catalog corruption. If there's a bug here, as opposed to random
> > > cosmic-ray damage, we can't fix it without more info.
>
> I eliminated the non-offending index with this query:
>
> select pg_get_indexdef(indexrelid)
> from pg_index
> where indexrelid <> 604251 -- this is the index with the problem
> order by indexrelid;
>
> How do I go about determining if the crash i caused by faulty hardware or a possible bug?

I finially narrowed to search down to the offending column and rows that causes the crash. The
following two queries work as long as I don't combine indexname='index_daily' and indexdef.

select *
from pg_indexes
where indexname <> 'index_daily';
....
returns all rows execpt the affected row
....

select schemaname,
tablename,
indexname,
tablespace --returns all columns except the affected column
from pg_indexes
where indexname = 'index_daily';
schemaname | tablename | indexname | tablespace
------------+-----------+-------------+------------
public | process | index_daily |

and finially, to show that it crashes:

select indexdef
from pg_indexes
where indexname = 'index_daily';

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Regards,

Richard Broersma Jr.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 04:41:28
Message-ID: 29612.1161924088@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Richard Broersma Jr <rabroersma(at)yahoo(dot)com> writes:
> mydb=# select pg_get_indexdef(indexrelid) from pg_index where indrelid = '16737';
> server closed the connection unexpectedly

So what do you get from 'select * from pg_index where indrelid = 16737' ?
If that crashes, which individual columns can you select from the row?

regards, tom lane


From: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 13:25:05
Message-ID: 865044.52015.qm@web31814.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> > mydb=# select pg_get_indexdef(indexrelid) from pg_index where indrelid = '16737';
> > server closed the connection unexpectedly
>
> So what do you get from 'select * from pg_index where indrelid = 16737' ?
> If that crashes, which individual columns can you select from the row?

I found the exact row and column that causes the server to crash.

http://archives.postgresql.org/pgsql-general/2006-10/msg01320.php

I believe that corrupted value should be:

CREATE INDEX index_daily ON process USING btree (date_trunc('day'::text, tstamp))

I tried to:

select * from process;
select * from process limit 1;
explain analyze select * from process;

but they also crashes the server.

Regards,

Richard Broersma Jr.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 14:46:40
Message-ID: 5562.1161960400@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Richard Broersma Jr <rabroersma(at)yahoo(dot)com> writes:
>> So what do you get from 'select * from pg_index where indrelid = 16737' ?
>> If that crashes, which individual columns can you select from the row?

> I found the exact row and column that causes the server to crash.
> http://archives.postgresql.org/pgsql-general/2006-10/msg01320.php

That didn't answer my question.

regards, tom lane


From: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 14:55:32
Message-ID: 20061027145532.78470.qmail@web31812.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> >> So what do you get from 'select * from pg_index where indrelid = 16737' ?
> >> If that crashes, which individual columns can you select from the row?
>
> > I found the exact row and column that causes the server to crash.
> > http://archives.postgresql.org/pgsql-general/2006-10/msg01320.php
>
> That didn't answer my question.

mydb=# \x
Expanded display is on.
mydb=# select * from pg_index where indrelid = 16737;
-[ RECORD 1 ]--+--------------------------
indexrelid | 604243
indrelid | 16737
indnatts | 1
indisunique | t
indisprimary | t
indisclustered | t
indkey | 1
indclass | 2039
indexprs |
indpred |
-[ RECORD 2 ]--+--------------------------
indexrelid | 604251
indrelid | 16737
indnatts | 1
indisunique | f
indisprimary | f
indisclustered | f
indkey | 0
indclass | 2039
indexprs | ({FUNCEXPR :funcid 2020 :funcresulttype 1114 :funcretset false :funcformat 0
:args ({CONST :consttype 25 :constlen -1 :constbyval false :constisnull false :constvalue 7 [ 7 0
0 0 100 97 121 ]} {VAR :varno 1 :varattno
indpred |
-[ RECORD 3 ]--+--------------------------
indexrelid | 604252
indrelid | 16737
indnatts | 1
indisunique | f
indisprimary | f
indisclustered | f
indkey | 0
indclass | 2039
indexprs | ({FUNCEXPR :funcid 2020 :funcresulttype 1114 :funcretset false :funcformat 0
:args ({CONST :consttype 25 :constlen -1 :constbyval false :constisnull false :constvalue 8 [ 8 0
0 0 104 111 117 114 ]} {VAR :varno 1 :varattno 1 :vartype 1114 :vartypmod -1 :varlevelsup 0
:varnoold 1 :varoattno 1})})
indpred |
-[ RECORD 4 ]--+--------------------------
indexrelid | 604253
indrelid | 16737
indnatts | 1
indisunique | f
indisprimary | f
indisclustered | f
indkey | 0
indclass | 2039
indexprs | ({FUNCEXPR :funcid 2020 :funcresulttype 1114 :funcretset false :funcformat 0
:args ({CONST :consttype 25 :constlen -1 :constbyval false :constisnull false :constvalue 10 [ 10
0 0 0 109 105 110 117 116 101 ]} {VAR :varno 1 :varattno 1 :vartype 1114 :vartypmod -1
:varlevelsup 0 :varnoold 1 :varoattno 1})})
indpred |

Sorry about that,

Regards,

Richard Broersma Jr.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 15:35:25
Message-ID: 6114.1161963325@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Richard Broersma Jr <rabroersma(at)yahoo(dot)com> writes:
> -[ RECORD 2 ]--+--------------------------
> indexrelid | 604251
> indrelid | 16737
> indnatts | 1
> indisunique | f
> indisprimary | f
> indisclustered | f
> indkey | 0
> indclass | 2039
> indexprs | ({FUNCEXPR :funcid 2020 :funcresulttype 1114 :funcretset false :funcformat 0
> :args ({CONST :consttype 25 :constlen -1 :constbyval false :constisnull false :constvalue 7 [ 7 0
> 0 0 100 97 121 ]} {VAR :varno 1 :varattno
> indpred |

OK, there's your problem: somehow the value of 'indexprs' for this entry
has gotten truncated. readfuncs.c is expecting another integer after
":varattno" and isn't checking for the possibility that pg_strtok()
returns NULL. It would be interesting to find out exactly what happened
to the value, but I'm not sure there's anything much we can do to find
it out. Are you interested in dumping out the page of pg_index that
contains this record, using pg_filedump?

In any case, my advice for recovering from the problem once you get
tired of investigating is to drop and recreate this index. (If you're
not sure which one this is, "select 604251::regclass" would tell you.)

regards, tom lane


From: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 15:43:42
Message-ID: 36971.73065.qm@web31808.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> Are you interested in dumping out the page of pg_index that
> contains this record, using pg_filedump?

Sure, only how do I use pg_filedump?

> In any case, my advice for recovering from the problem once you get
> tired of investigating is to drop and recreate this index. (If you're
> not sure which one this is, "select 604251::regclass" would tell you.)

I can hold off recreating the index for awhile. But, I will probably want to get a back-up of
this database in the next couple of weeks.

Regards,

Richard Broersma Jr.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 15:52:36
Message-ID: 6330.1161964356@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Richard Broersma Jr <rabroersma(at)yahoo(dot)com> writes:
>> Are you interested in dumping out the page of pg_index that
>> contains this record, using pg_filedump?

> Sure, only how do I use pg_filedump?

Find out the ctid of the busted pg_index record; the first part of it is
the page number. Then

pg_filedump -i -f -R page_number pg_index_filename >resultfile

The filename is whatever "select relfilenode from pg_class where relname
= 'pg_index'" tells you, in whatever subdirectory of $PGDATA/base
"select oid from pg_database where datname = 'yourdb'" tells you.

Oh, if you don't have pg_filedump, get it from
http://sources.redhat.com/rhdb/
Be sure to get the one matching your PG version.

regards, tom lane


From: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 21:05:43
Message-ID: 20061027210543.77996.qmail@web31802.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> >> Are you interested in dumping out the page of pg_index that
> >> contains this record, using pg_filedump?
>
> > Sure, only how do I use pg_filedump?
>
> Find out the ctid of the busted pg_index record; the first part of it is
> the page number. Then
>
> pg_filedump -i -f -R page_number pg_index_filename >resultfile
>
> The filename is whatever "select relfilenode from pg_class where relname
> = 'pg_index'" tells you, in whatever subdirectory of $PGDATA/base
> "select oid from pg_database where datname = 'yourdb'" tells you.

mydb=# select ctid from pg_index where indrelid = 16737 and indexrelid = 604251;
ctid
--------
(1,11)
(1 row)
mydb=# select relfilenode from pg_class where relname = 'pg_index';
relfilenode
-------------
2610
(1 row)
mydb=# select oid from pg_database where datname = 'mydb';
oid
-------
16393
(1 row)

postgres(at)db_server01 ~ $ ./pg_filedump -i -f -R 1 /var/lib/pgsql/data/base/16393/2610 >
./results.log

I've attached the results.log to the email.

Regards,

Richard Broersma Jr.

Attachment Content-Type Size
results.log application/octet-stream 23.3 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 21:20:31
Message-ID: 10314.1161984031@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Richard Broersma Jr <rabroersma(at)yahoo(dot)com> writes:
>> Are you interested in dumping out the page of pg_index that
>> contains this record, using pg_filedump?

> I've attached the results.log to the email.

Well, there's our smoking gun:

Item 11 -- Length: 398 Offset: 6844 (0x1abc) Flags: USED
XMIN: 940 CMIN: 0 XMAX: 0 CMAX|XVAC: 0
Block Id: 1 linp Index: 11 Attributes: 10 Size: 32
infomask: 0x0903 (HASNULL|HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID)
t_bits: [0]: 0xff [1]: 0x01

1abc: ac030000 00000000 00000000 00000000 ................
1acc: 00000100 0b000a00 030920ff 01000000 .......... .....
1adc: 5b380900 61410000 01000000 00000000 [8..aA..........
1aec: 1a000000 01000000 00000000 15000000 ................
1afc: 01000000 00000000 00000000 1c000000 ................
1b0c: 01000000 00000000 1a000000 01000000 ................
1b1c: 00000000 f7070000 26010000 287b4655 ........&...({FU
1b2c: 4e434558 5052203a 66756e63 69642032 NCEXPR :funcid 2
1b3c: 30323020 3a66756e 63726573 756c7474 020 :funcresultt
1b4c: 79706520 31313134 203a6675 6e637265 ype 1114 :funcre
1b5c: 74736574 2066616c 7365203a 66756e63 tset false :func
1b6c: 666f726d 61742030 203a6172 67732028 format 0 :args (
1b7c: 7b434f4e 5354203a 636f6e73 74747970 {CONST :consttyp
1b8c: 65203235 203a636f 6e73746c 656e202d e 25 :constlen -
1b9c: 31203a63 6f6e7374 62797661 6c206661 1 :constbyval fa
1bac: 6c736520 3a636f6e 73746973 6e756c6c lse :constisnull
1bbc: 2066616c 7365203a 636f6e73 7476616c false :constval
1bcc: 75652037 205b2037 20302030 20302031 ue 7 [ 7 0 0 0 1
1bdc: 30302039 37203132 31205d7d 207b5641 00 97 121 ]} {VA
1bec: 52203a76 61726e6f 2031203a 76617261 R :varno 1 :vara
1bfc: 74746e6f 00000000 00000000 1c000000 ttno............
^^^^^^^^^^^^^^^^^^^^^^^^^^
1c0c: 01000000 00000000 1a000000 6d6f6420 ............mod
^^^^^^^^^^^^^^^^^^^^^^^^^^
1c1c: 2d31203a 7661726c 6576656c 73757020 -1 :varlevelsup
1c2c: 30203a76 61726e6f 6f6c6420 31203a76 0 :varnoold 1 :v
1c3c: 61726f61 74746e6f 20317d29 7d29 aroattno 1})})

The underlined stuff has overwritten what should be expression dump
text.

Seeing that the corruption begins at page offset 1c00 --- that is,
exactly on a 1K boundary --- I'm a bit inclined to suspect a disk
I/O glitch. You might want to run some hardware diagnostics on
your machine.

regards, tom lane


From: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_dumpall failing from possible corrupted shared memory
Date: 2006-10-27 21:28:08
Message-ID: 728090.94812.qm@web31813.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> Richard Broersma Jr <rabroersma(at)yahoo(dot)com> writes:
> >> Are you interested in dumping out the page of pg_index that
> >> contains this record, using pg_filedump?
>
> > I've attached the results.log to the email.
>
> Well, there's our smoking gun:
> 1bfc: 74746e6f 00000000 00000000 1c000000 ttno............
> ^^^^^^^^^^^^^^^^^^^^^^^^^^
> 1c0c: 01000000 00000000 1a000000 6d6f6420 ............mod
> ^^^^^^^^^^^^^^^^^^^^^^^^^^

> The underlined stuff has overwritten what should be expression dump
> text.
>
> Seeing that the corruption begins at page offset 1c00 --- that is,
> exactly on a 1K boundary --- I'm a bit inclined to suspect a disk
> I/O glitch. You might want to run some hardware diagnostics on
> your machine.

Thanks Tom for helping me through this mental exercise. Yes, I would agree my server hardware is
suspect and should be replaced concidering with the above evidence.

Regards,

Richard Broersma Jr.