Re: production server down

Lists: pgsql-hackers
From: Joe Conway <mail(at)joeconway(dot)com>
To: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: production server down
Date: 2004-12-15 03:11:56
Message-ID: 41BFAB7C.5040108@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've got a down production server (will not restart) with the following
tail to its log file:

2004-12-13 15:05:52 LOG: recycled transaction log file "000001650000004C"
2004-12-13 15:26:01 LOG: recycled transaction log file "000001650000004D"
2004-12-13 16:39:55 LOG: database system was shut down at 2004-11-02
17:05:33 PST
2004-12-13 16:39:55 LOG: checkpoint record is at 0/9B0B8C
2004-12-13 16:39:55 LOG: redo record is at 0/9B0B8C; undo record is at
0/0; shutdown TRUE
2004-12-13 16:39:55 LOG: next transaction ID: 536; next OID: 17142
2004-12-13 16:39:55 LOG: database system is ready
2004-12-14 15:36:20 FATAL: IDENT authentication failed for user "colprod"
2004-12-14 15:36:58 FATAL: IDENT authentication failed for user "colprod"
2004-12-14 15:39:26 LOG: received smart shutdown request
2004-12-14 15:39:26 LOG: shutting down
2004-12-14 15:39:28 PANIC: could not open file
"/replica/pgdata/pg_xlog/0000000000000000" (log file 0, segment 0): No
such file or directory
2004-12-14 15:39:28 LOG: shutdown process (PID 23202) was terminated by
signal 6
2004-12-14 15:39:39 LOG: database system shutdown was interrupted at
2004-12-14 15:39:26 PST
2004-12-14 15:39:39 LOG: could not open file
"/replica/pgdata/pg_xlog/0000000000000000" (log file 0, segment 0): No
such file or directory
2004-12-14 15:39:39 LOG: invalid primary checkpoint record
2004-12-14 15:39:39 LOG: could not open file
"/replica/pgdata/pg_xlog/0000000000000000" (log file 0, segment 0): No
such file or directory
2004-12-14 15:39:39 LOG: invalid secondary checkpoint record
2004-12-14 15:39:39 PANIC: could not locate a valid checkpoint record
2004-12-14 15:39:39 LOG: startup process (PID 23298) was terminated by
signal 6
2004-12-14 15:39:39 LOG: aborting startup due to startup process failure

This is a SuSE 9, 8-way Xeon IBM x445, with nfs mounted Network
Appliance for database storage, postgresql-7.4.5-36.4.

The server experienced a hang (as yet unexplained) yesterday and was
restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the
network admin that there was a problem with the network card on restart,
so the nfs mount most probably disappeared and then reappeared
underneath a quiescent postgresql at some point between 2004-12-13
16:39:55 and 2004-12-14 15:36:20 (but much closer to the former than the
latter).

Any help would be much appreciated. Is our only option pg_resetxlog?

Thanks,

Joe


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 03:42:50
Message-ID: 200412150342.iBF3got12800@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway wrote:
> This is a SuSE 9, 8-way Xeon IBM x445, with nfs mounted Network
> Appliance for database storage, postgresql-7.4.5-36.4.
>
> The server experienced a hang (as yet unexplained) yesterday and was
> restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the
> network admin that there was a problem with the network card on restart,
> so the nfs mount most probably disappeared and then reappeared
> underneath a quiescent postgresql at some point between 2004-12-13
> 16:39:55 and 2004-12-14 15:36:20 (but much closer to the former than the
> latter).

Well, my first reaction is that if the file system storage was not
always 100% reliable, then there is no way to know the data is correct
except by restoring from backup. The startup failure indicates that
there were surely storage problems in the past. There is no way to know
how far that corrupt goes.

You can use pg_resetxlog to clear it out and look to see how accurate it
is, but there is no way to be sure. I would back up the file system
with the server down in case you want to do some more serious recovery
attempts later though.

The Freenode IRC channel can probably walk you through more details of
the recovery process.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 04:42:03
Message-ID: 26597.1103085723@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway <mail(at)joeconway(dot)com> writes:
> I've got a down production server (will not restart) with the following
> tail to its log file:

Please show the output of pg_controldata, or a hex dump of pg_control
if pg_controldata fails.

> The server experienced a hang (as yet unexplained) yesterday and was
> restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the
> network admin that there was a problem with the network card on restart,
> so the nfs mount most probably disappeared and then reappeared
> underneath a quiescent postgresql at some point between 2004-12-13
> 16:39:55 and 2004-12-14 15:36:20 (but much closer to the former than the
> latter).

I've always felt that running a database across NFS was a Bad Idea ;-)

> Any help would be much appreciated. Is our only option pg_resetxlog?

Possibly, but let's try to dig first. I suppose the DB is too large
to save an image aside for forensics later?

regards, tom lane


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 05:14:06
Message-ID: 41BFC81E.3050706@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Joe Conway <mail(at)joeconway(dot)com> writes:
>
>>I've got a down production server (will not restart) with the following
>>tail to its log file:
>
> Please show the output of pg_controldata, or a hex dump of pg_control
> if pg_controldata fails.

OK, will do shortly.

>
>>The server experienced a hang (as yet unexplained) yesterday and was
>>restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the
>>network admin that there was a problem with the network card on restart,
>>so the nfs mount most probably disappeared and then reappeared
>>underneath a quiescent postgresql at some point between 2004-12-13
>>16:39:55 and 2004-12-14 15:36:20 (but much closer to the former than the
>>latter).
>
> I've always felt that running a database across NFS was a Bad Idea ;-)

Yeah, I knew I had that coming :-)

>>Any help would be much appreciated. Is our only option pg_resetxlog?
>
> Possibly, but let's try to dig first. I suppose the DB is too large
> to save an image aside for forensics later?
>

Actually, although the database is about 400 GB, we do have room and are
in the process of saving an image now.

Joe


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 05:22:42
Message-ID: 41BFCA22.5040807@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Joe Conway <mail(at)joeconway(dot)com> writes:
>>I've got a down production server (will not restart) with the following
>>tail to its log file:
>
> Please show the output of pg_controldata, or a hex dump of pg_control
> if pg_controldata fails.
>

OK, here it is:

# pg_controldata /replica/pgdata
pg_control version number: 72
Catalog version number: 200310211
Database cluster state: shutting down
pg_control last modified: Tue Dec 14 15:39:26 2004
Current log file ID: 0
Next log file segment: 1
Latest checkpoint location: 0/9B0B8C
Prior checkpoint location: 0/9AA1B4
Latest checkpoint's REDO location: 0/9B0B8C
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 12
Latest checkpoint's NextXID: 536
Latest checkpoint's NextOID: 17142
Time of latest checkpoint: Tue Nov 2 17:05:32 2004
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C

Joe


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 05:30:39
Message-ID: 26976.1103088639@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway <mail(at)joeconway(dot)com> writes:
> Tom Lane wrote:
>> Please show the output of pg_controldata, or a hex dump of pg_control
>> if pg_controldata fails.

> OK, here it is:

> ...
> pg_control last modified: Tue Dec 14 15:39:26 2004
> ...
> Time of latest checkpoint: Tue Nov 2 17:05:32 2004

[ blink... ] That seems like an unreasonable gap between checkpoints,
especially for a production server. Can you see an explanation?

regards, tom lane


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 05:50:02
Message-ID: 41BFD08A.5000501@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>>...
>>pg_control last modified: Tue Dec 14 15:39:26 2004
>>...
>>Time of latest checkpoint: Tue Nov 2 17:05:32 2004
>
> [ blink... ] That seems like an unreasonable gap between checkpoints,
> especially for a production server. Can you see an explanation?

Hmmm, this is even more scary. We have two database clusters on this
server, one on /replica/pgdata, and one on /production/pgdata (ignore
the names -- /replica is actually the "production" instance at the moment).

# pg_controldata /replica/pgdata
pg_control version number: 72
Catalog version number: 200310211
Database cluster state: shutting down
pg_control last modified: Tue Dec 14 15:39:26 2004
Current log file ID: 0
Next log file segment: 1
Latest checkpoint location: 0/9B0B8C
Prior checkpoint location: 0/9AA1B4
Latest checkpoint's REDO location: 0/9B0B8C
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 12
Latest checkpoint's NextXID: 536
Latest checkpoint's NextOID: 17142
Time of latest checkpoint: Tue Nov 2 17:05:32 2004
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C

# pg_controldata /production/pgdata
pg_control version number: 72
Catalog version number: 200310211
Database cluster state: shutting down
pg_control last modified: Tue Nov 2 21:57:49 2004
Current log file ID: 0
Next log file segment: 1
Latest checkpoint location: 0/9B0B8C
Prior checkpoint location: 0/9AA1B4
Latest checkpoint's REDO location: 0/9B0B8C
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 12
Latest checkpoint's NextXID: 536
Latest checkpoint's NextOID: 17142
Time of latest checkpoint: Tue Nov 2 17:05:32 2004
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C

I have no idea how this happened, but those look too similar except for
the "last modified" date. The space used is quite what I'd expect:

# du -h --max-depth=1 /replica
403G /replica/pgdata

# du -h --max-depth=1 /production
201G /production/pgdata

The "/production/pgdata" cluster has not been in use since Nov 2. But
we've been loading data aggressively into "/replica/pgdata".

Any theories on how we screwed up?

Joe


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 06:10:21
Message-ID: 27366.1103091021@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway <mail(at)joeconway(dot)com> writes:
> Any theories on how we screwed up?

I hesitate to suggest this, but maybe a cron job blindly copying data
from point A to point B?

I'm not sure that that could entirely explain the facts. My
recollection of the xlog.c logic is that the pg_control file is read
into shared memory during postmaster boot, and after that it's
write-only: at checkpoint times we update the file image in shared
memory and then write it out to pg_control.

Offhand my bets would revolve around (a) multiple postmasters trying to
run the same PGDATA directory (we have interlocks to protect against
this, but I have no faith that they work against an NFS-mounted data
directory), or (b) you somehow wiped a PGDATA directory and restored it
from backup tapes underneath a running postmaster.

regards, tom lane


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 06:25:15
Message-ID: 41BFD8CB.6030902@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Joe Conway <mail(at)joeconway(dot)com> writes:
>> Any theories on how we screwed up?
>
> I hesitate to suggest this, but maybe a cron job blindly copying data
> from point A to point B?

Not likely, but I'll check.

> Offhand my bets would revolve around (a) multiple postmasters trying
> to run the same PGDATA directory (we have interlocks to protect
> against this, but I have no faith that they work against an
> NFS-mounted data directory)

This might be possible I suppose. I know we have two init scripts.
Perhaps there is an error in them that caused both postmasters to point
to the same place when the server was rebooted. I'll look them over.

> or (b) you somehow wiped a PGDATA directory and restored it from
> backup tapes underneath a running postmaster.

This seems highly unlikely because our *nix admin would have had to
deliberately do it, and I don't think he'd fail to tell me about
something like that. But all the same, I'll ask him tomorrow.

Assuming the only real problem here is the control data (long shot, I
know), and the actual database files and transaction logs are OK, is
there any reasonable way to reconstruct the correct contol data? Or is
that the point at which you use pg_resetxlog?

Joe


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 06:32:43
Message-ID: 27529.1103092363@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway <mail(at)joeconway(dot)com> writes:
> Assuming the only real problem here is the control data (long shot, I
> know), and the actual database files and transaction logs are OK, is
> there any reasonable way to reconstruct the correct contol data? Or is
> that the point at which you use pg_resetxlog?

Well, the problem is that if you can't trust pg_control you don't know
what you can trust.

My advice is to backup the $PGDATA tree (which you said was in
progress), then pg_resetxlog, then cross-check the hell out of the data
you see. Only if you can detect some data problems can we guess at
something else to do ...

regards, tom lane


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 07:07:51
Message-ID: 41BFE2C7.7070007@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> My advice is to backup the $PGDATA tree (which you said was in
> progress), then pg_resetxlog, then cross-check the hell out of the data
> you see. Only if you can detect some data problems can we guess at
> something else to do ...
>

OK. I plan to gather the usual suspects and try to get an accurate
picture of the chain of events first thing tomorrow. Then we'll likely
proceed as you suggest.

Thinking about your comments and reading xlog.c, it almost seems as
though the mount points were momentarily reversed between /replica and
/production. I.e. that the /production mount point was used near the
beginning of StartupXLOG() for ReadControlFile(), and the /replica mount
point was used at the end of StartupXLOG() for UpdateControlFile(). But
I have no idea how that could happen.

Thanks,

Joe


From: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 15:34:03
Message-ID: 20041215153403.GA5075@dcc.uchile.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 14, 2004 at 09:22:42PM -0800, Joe Conway wrote:

> # pg_controldata /replica/pgdata

> Current log file ID: 0
> Next log file segment: 1
> Latest checkpoint location: 0/9B0B8C
> Prior checkpoint location: 0/9AA1B4
> Latest checkpoint's REDO location: 0/9B0B8C
> Latest checkpoint's UNDO location: 0/0
> Latest checkpoint's StartUpID: 12
> Latest checkpoint's NextXID: 536
> Latest checkpoint's NextOID: 17142

Isn't it strange that these values are so close to the values found in a
just-initdb'd cluster?

--
Alvaro Herrera (<alvherre[(at)]dcc(dot)uchile(dot)cl>)
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 16:56:19
Message-ID: 41C06CB3.6010107@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> My advice is to backup the $PGDATA tree (which you said was in
> progress), then pg_resetxlog, then cross-check the hell out of the data
> you see. Only if you can detect some data problems can we guess at
> something else to do ...
>

Before running pg_resetxlog, a couple of questions:

1. Since it appears that pg_control is suspect, should I force it to be
rebuilt, and if so, how?

2. At the end of GuessControlValues is this comment:
/*
* XXX eventually, should try to grovel through old XLOG to develop
* more accurate values for startupid, nextXID, and nextOID.
*/
What would be involved in doing this, and do you think it would be
worth it?

Thanks,

Joe


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 17:05:54
Message-ID: 2608.1103130354@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway <mail(at)joeconway(dot)com> writes:
> Before running pg_resetxlog, a couple of questions:

> 1. Since it appears that pg_control is suspect, should I force it to be
> rebuilt, and if so, how?

pg_resetxlog will rebuild it in any case. However it will re-use the
existing contents as much as it can (if you don't use any of the command
line options to override values). Given Alvaro's observation that the
existing file looks suspiciously close to a freshly-initdb'd one, I
don't think you want to trust the existing contents.

> 2. At the end of GuessControlValues is this comment:
> /*
> * XXX eventually, should try to grovel through old XLOG to develop
> * more accurate values for startupid, nextXID, and nextOID.
> */
> What would be involved in doing this, and do you think it would be
> worth it?

What if anything have you got in $PGDATA/pg_xlog?

regards, tom lane


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 17:12:21
Message-ID: 41C07075.6020508@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> pg_resetxlog will rebuild it in any case. However it will re-use the
> existing contents as much as it can (if you don't use any of the command
> line options to override values). Given Alvaro's observation that the
> existing file looks suspiciously close to a freshly-initdb'd one, I
> don't think you want to trust the existing contents.

I don't trust it at all. So does that imply that I should override next
transaction id and WAL starting address per the manpage?

>
> What if anything have you got in $PGDATA/pg_xlog?
>
# pwd
/replica/pgdata/pg_xlog
# ll
total 688836
drwx------ 2 postgres postgres 32768 Dec 13 15:47 .
drwx------ 6 postgres postgres 4096 Dec 14 17:45 ..
-rw------- 1 postgres postgres 16777216 Dec 13 16:02 000001650000004E
-rw------- 1 postgres postgres 16777216 Dec 13 06:42 000001650000004F
-rw------- 1 postgres postgres 16777216 Dec 13 06:55 0000016500000050
-rw------- 1 postgres postgres 16777216 Dec 13 07:21 0000016500000051
-rw------- 1 postgres postgres 16777216 Dec 13 07:41 0000016500000052
-rw------- 1 postgres postgres 16777216 Dec 13 07:57 0000016500000053
-rw------- 1 postgres postgres 16777216 Dec 13 08:00 0000016500000054
-rw------- 1 postgres postgres 16777216 Dec 13 08:04 0000016500000055
-rw------- 1 postgres postgres 16777216 Dec 13 08:09 0000016500000056
-rw------- 1 postgres postgres 16777216 Dec 13 08:13 0000016500000057
-rw------- 1 postgres postgres 16777216 Dec 13 08:26 0000016500000058
-rw------- 1 postgres postgres 16777216 Dec 13 08:42 0000016500000059
-rw------- 1 postgres postgres 16777216 Dec 13 09:09 000001650000005A
-rw------- 1 postgres postgres 16777216 Dec 13 09:23 000001650000005B
-rw------- 1 postgres postgres 16777216 Dec 13 09:40 000001650000005C
-rw------- 1 postgres postgres 16777216 Dec 13 09:51 000001650000005D
-rw------- 1 postgres postgres 16777216 Dec 13 09:58 000001650000005E
-rw------- 1 postgres postgres 16777216 Dec 13 10:03 000001650000005F
-rw------- 1 postgres postgres 16777216 Dec 13 10:09 0000016500000060
-rw------- 1 postgres postgres 16777216 Dec 13 10:24 0000016500000061
-rw------- 1 postgres postgres 16777216 Dec 13 10:37 0000016500000062
-rw------- 1 postgres postgres 16777216 Dec 13 10:56 0000016500000063
-rw------- 1 postgres postgres 16777216 Dec 13 11:11 0000016500000064
-rw------- 1 postgres postgres 16777216 Dec 13 11:38 0000016500000065
-rw------- 1 postgres postgres 16777216 Dec 13 11:52 0000016500000066
-rw------- 1 postgres postgres 16777216 Dec 13 11:56 0000016500000067
-rw------- 1 postgres postgres 16777216 Dec 13 12:04 0000016500000068
-rw------- 1 postgres postgres 16777216 Dec 13 12:07 0000016500000069
-rw------- 1 postgres postgres 16777216 Dec 13 12:17 000001650000006A
-rw------- 1 postgres postgres 16777216 Dec 13 12:29 000001650000006B
-rw------- 1 postgres postgres 16777216 Dec 13 12:52 000001650000006C
-rw------- 1 postgres postgres 16777216 Dec 13 13:15 000001650000006D
-rw------- 1 postgres postgres 16777216 Dec 13 13:36 000001650000006E
-rw------- 1 postgres postgres 16777216 Dec 13 13:51 000001650000006F
-rw------- 1 postgres postgres 16777216 Dec 13 13:59 0000016500000070
-rw------- 1 postgres postgres 16777216 Dec 13 14:06 0000016500000071
-rw------- 1 postgres postgres 16777216 Dec 13 14:10 0000016500000072
-rw------- 1 postgres postgres 16777216 Dec 13 14:15 0000016500000073
-rw------- 1 postgres postgres 16777216 Dec 13 14:37 0000016500000074
-rw------- 1 postgres postgres 16777216 Dec 13 14:51 0000016500000075
-rw------- 1 postgres postgres 16777216 Dec 13 15:17 0000016500000076
-rw------- 1 postgres postgres 16777216 Dec 13 15:39 0000016500000077

Joe


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 17:17:37
Message-ID: 2756.1103131057@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway <mail(at)joeconway(dot)com> writes:
> I don't trust it at all. So does that imply that I should override next
> transaction id and WAL starting address per the manpage?

Yes, override everything there's a switch for. Also check that the
other values shown by pg_controldata look reasonable (the locale
settings are probably the only ones you might get burned on).

>> What if anything have you got in $PGDATA/pg_xlog?
>>
> -rw------- 1 postgres postgres 16777216 Dec 13 15:39 0000016500000077

Um. That's so far from the values shown in pg_control that it's not funny.

This is 7.4, right? I have a crude xlog dump tool that I'll send you
off-list. We should be able to identify the latest checkpoint in the
existing XLOG files, and that will give you something to work with.

regards, tom lane


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 17:24:57
Message-ID: 41C07369.6030405@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Joe Conway <mail(at)joeconway(dot)com> writes:
>> I don't trust it at all. So does that imply that I should override
>> next transaction id and WAL starting address per the manpage?
>
> Yes, override everything there's a switch for. Also check that the
> other values shown by pg_controldata look reasonable (the locale
> settings are probably the only ones you might get burned on).

OK

>>> What if anything have you got in $PGDATA/pg_xlog?
>>
>> -rw------- 1 postgres postgres 16777216 Dec 13 15:39
>> 0000016500000077
>
> Um. That's so far from the values shown in pg_control that it's not
> funny.
>
> This is 7.4, right?

Correct.

> I have a crude xlog dump tool that I'll send you off-list. We should
> be able to identify the latest checkpoint in the existing XLOG files,
> and that will give you something to work with.

Thanks,

Joe


From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: production server down
Date: 2004-12-15 17:28:36
Message-ID: 87zn0fqxmj.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> > The server experienced a hang (as yet unexplained) yesterday and was
> > restarted at 2004-12-13 16:38:49 according to syslog. I'm told by the
> > network admin that there was a problem with the network card on restart,
> > so the nfs mount most probably disappeared and then reappeared
> > underneath a quiescent postgresql at some point between 2004-12-13
> > 16:39:55 and 2004-12-14 15:36:20 (but much closer to the former than the
> > latter).
>
> I've always felt that running a database across NFS was a Bad Idea ;-)

Well not that I disagree with that sentiment, but NFS was specifically
designed to handle this particular scenario. *UNLESS* you use the "soft"
option. As popular as it is, this is precisely the scenario where it causes
problems.

(The "intr" option as well, but I don't think that would be relevant for
postgres).

--
greg


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 19:41:02
Message-ID: 41C0934E.8000300@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Yes, override everything there's a switch for. Also check that the
> other values shown by pg_controldata look reasonable (the locale
> settings are probably the only ones you might get burned on).
>
>
>>>What if anything have you got in $PGDATA/pg_xlog?
>>
>>-rw------- 1 postgres postgres 16777216 Dec 13 15:39 0000016500000077
>
> Um. That's so far from the values shown in pg_control that it's not funny.
>
> This is 7.4, right? I have a crude xlog dump tool that I'll send you
> off-list. We should be able to identify the latest checkpoint in the
> existing XLOG files, and that will give you something to work with.
>

Just wanted to close the loop for the sake of the list archives. With
Tom's xlog dump tool I was able (with a bunch of his help off-list) to
identify the needed parameters for pg_resetxlog. Running pg_resetxlog
got us back a running database. We're now involved in checking the data.

Thank you to everyone for your help -- especially Tom!

Joe


From: Michael Fuhr <mike(at)fuhr(dot)org>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-15 21:16:48
Message-ID: 20041215211648.GA27182@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 15, 2004 at 11:41:02AM -0800, Joe Conway wrote:

> Just wanted to close the loop for the sake of the list archives. With
> Tom's xlog dump tool I was able (with a bunch of his help off-list) to
> identify the needed parameters for pg_resetxlog. Running pg_resetxlog
> got us back a running database. We're now involved in checking the data.

Any chance you could write up a summary of the thread: what caused
the problem, how you diagnosed it, how you fixed it, and how to
avoid it? Might make a useful "lessons learned" document.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/


From: Joe Conway <mail(at)joeconway(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: production server down
Date: 2004-12-15 22:22:48
Message-ID: 41C0B938.9030006@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>>I've always felt that running a database across NFS was a Bad Idea ;-)
>
> Well not that I disagree with that sentiment, but NFS was specifically
> designed to handle this particular scenario. *UNLESS* you use the "soft"
> option. As popular as it is, this is precisely the scenario where it causes
> problems.
>
> (The "intr" option as well, but I don't think that would be relevant for
> postgres).

I checked; doesn't seem to be an issue:

# cat /etc/fstab
csddata7-vlan35:/vol/vol0/replica /replica nfs
proto=tcp,suid,rw,vers=3,proto=tcp,timeo=600,retrans=2,hard,fg,rsize=8192,wsize=8192
0 0

I'd be interested in any feedback on the settings. The connection
between server and NetApp is on a private vlan on a separate network
interface than the rest of the LAN. I'm not positive, but I think we
have jumbo frames (9K) enabled on that interface ... looks like we do:

# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:09:6B:E6:33:B7
[...]
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1

Thanks,

Joe


From: Joe Conway <mail(at)joeconway(dot)com>
To: Michael Fuhr <mike(at)fuhr(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-18 22:28:51
Message-ID: 41C4AF23.9070406@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Michael Fuhr wrote:
> On Wed, Dec 15, 2004 at 11:41:02AM -0800, Joe Conway wrote:
>
>>Just wanted to close the loop for the sake of the list archives. With
>>Tom's xlog dump tool I was able (with a bunch of his help off-list) to
>>identify the needed parameters for pg_resetxlog. Running pg_resetxlog
>>got us back a running database. We're now involved in checking the data.
>
> Any chance you could write up a summary of the thread: what caused
> the problem, how you diagnosed it, how you fixed it, and how to
> avoid it? Might make a useful "lessons learned" document.

Sorry for the delay -- been a busy week. See a summary below. Hope
someone finds this useful. Warning -- this is a bit long...

-----------
Background:
-----------
The server in question is an IBM x445, 8-way Xeon, 8 GB RAM. We're
running SuSE 9 with the postgresql-7.4.5-36.4 RPM. The database is just
over 400GB in size at this point, and resides on a dedicated NFS mounted
Network Appliance volume (~6 TB). The server has 2 network interfaces,
both gigabit ethernet. One interface (eth0) is dedicated to the NFS
mounted storage. It is on a private storage subsystem vlan, running with
"jumbo frames" (9K):

# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:09:6B:E6:33:B7
[...]
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1

The data volume is mounted thus:

# cat /etc/fstab
csddata7-vlan35:/vol/vol0/replica /replica nfs
proto=tcp,suid,rw,vers=3,proto=tcp,timeo=600,retrans=2,hard,fg,rsize=8192,wsize=8192
0 0

We have been continuously and aggressively bulk loading parametric data
collected from our company's equipment in the field (i.e. in use at our
customers) for the past several weeks. This is part of a push to get
"caught up" with available data, after which we expect bulk loading to
take a few hours each evening. The server had been up since November 2,
2004.

On December 13 the server experienced a complete hang, requiring our
unix admin to go into the datacenter and physically cycle power. We
don't know the exact cause of that hang, but we have recently
experienced a similar hang on two similar servers (both IBM x445, both
running SuSE 8.x, one running Oracle 9i, the other an application server).

<aside>
In both of those cases we were advised to increase the size of our
swap partition -- for some reason either SuSE's installation defaults,
or the admin doing the installation (not sure which), had decided that
1GB swap was sufficient for machines with 8GB of RAM. Novell's support
advised to increase swap to >= physical RAM. We had made that change
on the two other servers, but not on the Postgres server because it
was busy loading data at the time. We've now made that change, applied
all IBM firmware updates, and are waiting to see if the problem
repeats.
</aside>

Upon server restart, there was instability noted with the network
interfaces, and so they were reset (not exactly sure how the unix admin
determined that, but that is what I was told). Unfortunately, the init.d
script was in the process of starting postgresql while the eth0
interface (and thus the NFS mount and PGDATA) was "unstable".

--------------------------------
The PostgreSQL Specific Problem:
--------------------------------
We decided to wait until the 14th before resuming data loading, because
the developer who wrote the loading scripts was due back in the office
that day after taking vacation. On the 14th, he noted that Postgres was
not running, and further, would not start. Here is the snippet from the
logs:

2004-12-13 15:05:52 LOG: recycled transaction log file "000001650000004C"
2004-12-13 15:26:01 LOG: recycled transaction log file "000001650000004D"
2004-12-13 16:39:55 LOG: database system was shut down at 2004-11-02
17:05:33 PST
2004-12-13 16:39:55 LOG: checkpoint record is at 0/9B0B8C
2004-12-13 16:39:55 LOG: redo record is at 0/9B0B8C; undo record is at
0/0; shutdown TRUE
2004-12-13 16:39:55 LOG: next transaction ID: 536; next OID: 17142
2004-12-13 16:39:55 LOG: database system is ready
2004-12-14 15:36:20 FATAL: IDENT authentication failed for user "colprod"
2004-12-14 15:36:58 FATAL: IDENT authentication failed for user "colprod"
2004-12-14 15:39:26 LOG: received smart shutdown request
2004-12-14 15:39:26 LOG: shutting down
2004-12-14 15:39:28 PANIC: could not open file
"/replica/pgdata/pg_xlog/0000000000000000" (log file 0, segment 0): No
such file or directory
2004-12-14 15:39:28 LOG: shutdown process (PID 23202) was terminated by
signal 6
2004-12-14 15:39:39 LOG: database system shutdown was interrupted at
2004-12-14 15:39:26 PST
2004-12-14 15:39:39 LOG: could not open file
"/replica/pgdata/pg_xlog/0000000000000000" (log file 0, segment 0): No
such file or directory
2004-12-14 15:39:39 LOG: invalid primary checkpoint record
2004-12-14 15:39:39 LOG: could not open file
"/replica/pgdata/pg_xlog/0000000000000000" (log file 0, segment 0): No
such file or directory
2004-12-14 15:39:39 LOG: invalid secondary checkpoint record
2004-12-14 15:39:39 PANIC: could not locate a valid checkpoint record
2004-12-14 15:39:39 LOG: startup process (PID 23298) was terminated by
signal 6
2004-12-14 15:39:39 LOG: aborting startup due to startup process failure

Apparently, either because of the server hang, or because of the flakey
eth0 interface on reboot, pg_control had become "corrupt". However, it
was not corrupt in the sense that it contained impossibly invalid data.
In fact, as pointed out by Alvaro, it had values that all look close to
those one would find in a recently initdb'd pg_control file, except the
last modified date:

# pg_controldata /replica/pgdata
pg_control version number: 72
Catalog version number: 200310211
Database cluster state: shutting down
pg_control last modified: Tue Dec 14 15:39:26 2004
Current log file ID: 0
Next log file segment: 1
Latest checkpoint location: 0/9B0B8C
Prior checkpoint location: 0/9AA1B4
Latest checkpoint's REDO location: 0/9B0B8C
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 12
Latest checkpoint's NextXID: 536
Latest checkpoint's NextOID: 17142
Time of latest checkpoint: Tue Nov 2 17:05:32 2004
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C

We still do not know exactly how the pg_control file came to look like
this. I suspect the NFS mount instability during server start somehow
caused it, but I can't see anything obvious in xlog.c that would explain
that theory.

-----------
The Repair:
-----------
Based on Tom's advice, we set out to repair pg_control using
pg_resetxlog (but first thing, of course, we saved a copy of $PGDATA in
its present, unworking, state).

The manpage for pg_resetxlog gives some general idea how it is used, and
a way to estimate the next transaction id and wal segment. It does not
have a way to recover next OID, but points out that the value of next
OID isn't really all that critical. I was interested if we could do
better than estimates, based on this comment in pg_resetxlog.c:

/*
* XXX eventually, should try to grovel through old XLOG to develop
* more accurate values for startupid, nextXID, and nextOID.
*/

Based on the fact that we had an intact looking set of files in
$PGDATA/pg_xlog, and that we were running 7.4, Tom offered to send a
crude xlog dump tool that he has written for his own use (aside: this
worked really well in my case; should it be part of the distribution?).

After trading a few off-list emails with Tom, I was able to find the
location of the last good checkpoint in the xlog files:

357/4EBFD094: prv 357/4EBFD04C; xprv 357/4EBFD04C; xid 3877222; RM 10
info 00 len 34
insert: tblSpace 17144 rel 22084 block 557795 off 89
357/4EBFD0D8: prv 357/4EBFD094; xprv 357/4EBFD094; xid 3877222; RM 11
info 00 len 38
357/4EBFD120: prv 357/4EBFD0D8; xprv 357/4EBFD0D8; xid 3877222; RM 10
info 00 len 34
insert: tblSpace 17144 rel 22084 block 557795 off 90
357/4EBFD164: prv 357/4EBFD120; xprv 357/4EBFD120; xid 3877222; RM 11
info 00 len 38
357/4EBFD1AC: prv 357/4EBFD164; xprv 0/00000000; xid 0; RM 0 info 10 len 32
checkpoint: redo 357/4EBFD1AC; undo 0/00000000; sui 58;
nextxid 3904963; nextoid 785429799; online at 2004-12-13 15:41:03 PST
ReadRecord: record with zero len at 357/4EBFD1EC
Unexpected page info flags 0001 at offset BFE000
Skipping unexpected continuation record at offset BFE000
357/4EBFE024: prv 357/24BFDFC8(?); xprv 357/24BFDFC8; xid 3877261; RM 10
info 00 len 45
insert: tblSpace 17144 rel 22114 block 2599 off 97
357/4EBFE074: prv 357/24BFE024(?); xprv 357/24BFE024; xid 3877261; RM 11
info 00 len 38
357/4EBFE0BC: prv 357/24BFE074(?); xprv 357/24BFE074; xid 3877261; RM 10
info 00 len 44
insert: tblSpace 17144 rel 22114 block 2599 off 98

On seeing the above, Tom said this"
"Yeah, this is exactly what we want: the "record with zero len" is the
actual end of WAL, and the complaints after that betray the fact that
the next page isn't in sequence. So it looks like you want to use
those nextxid and nextoid values, plus make sure that the starting
WAL filename is > 000001650000004E. I believe "-l 0x165,0x4F" should
handle that."

Based on that advice, here's what we ran:
pg_resetxlog -f -o 786000000 -x 0x3b95c3 -l 0x165,0x4f /replica/pgdata

We bumped the next OID up a bit just "for good measure" and knowing the
specific value is less critical than the others. The pg_resetxlog
command runs in about a second or two. Then we attempted to start
Postgres -- and it started up fine -- whew!

Here's what the resulting pg_control file looked like:

# pg_controldata /replica/pgdata
pg_control version number: 72
Catalog version number: 200310211
Database cluster state: in production
pg_control last modified: Sat Dec 18 13:18:08 2004
Current log file ID: 369
Next log file segment: 94
Latest checkpoint location: 171/5D48F21C
Prior checkpoint location: 171/5AECF030
Latest checkpoint's REDO location: 171/5D48F21C
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 14
Latest checkpoint's NextXID: 3955133
Latest checkpoint's NextOID: 797476992
Time of latest checkpoint: Sat Dec 18 13:18:06 2004
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C

We then spent most of the next 24 hours reviewing the recovered
database. The bulk data loading process was well instrumented, so we
knew exactly which data should have been committed prior to the server
hang, and which files were inprocess (we had been doing 10 loads in
parallel) at the time of the hang. The results of the investigation
indicated complete recovery, with no missing or unwanted (i.e.
uncommitted records looking committed) data.

-------------------
Corrective Actions:
-------------------
As mentioned earlier, one action was to increase our swap partition to
10 GB, based on Novell's advice. We also updated to the latest and
greatest of SuSE 9 rpms, and IBM firmware. I believe we were already
running the latest broadcom (network card) drivers, but I'm sure we
upgraded those also if they weren't. This will all hopefully address the
issue that caused the server hang in the first place. Won't know for
sure until many months go by without incident.

Second change was to run:
chkconfig postgresql off
This sets Postgres to *not* automatically restart on a server restart.
Our DBA pointed out that we do not allow Oracle to restart automatically
precisely because it is on NFS mounted storage. A manual restart
potentially requires a warm body in the middle of the night, but it
ensures that we can check that the server and the NFS mount are stable
before starting the database.

Third change we decided to make was to reduce the number of parallel
bulk data loads. We have 8 CPUs, so probably something less than 8 is a
good number to use. Of course I just now looked, and it appears that we
have resumed loading with 10 processes, so I guess I'll have to follow
up on that next Monday ;-).

Finally, we are running hyperthreaded, and I know others have reported
that performance is actually better when hyperthreading is disabled. No
firm decision has been made on that, but I also don't believe it is
related to the problem we experienced. Of course, I could be wrong.

-------------------------------

That's it -- any comments, advice, or thoughts are welcome.

Joe


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: Michael Fuhr <mike(at)fuhr(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-18 22:42:29
Message-ID: 200412182242.iBIMgTm24674@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway wrote:
> We then spent most of the next 24 hours reviewing the recovered
> database. The bulk data loading process was well instrumented, so we
> knew exactly which data should have been committed prior to the server
> hang, and which files were inprocess (we had been doing 10 loads in
> parallel) at the time of the hang. The results of the investigation
> indicated complete recovery, with no missing or unwanted (i.e.
> uncommitted records looking committed) data.

Good analysis. Seems you were lucky in that only WAL was corrupt and
not the heap or index files.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: Michael Fuhr <mike(at)fuhr(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-18 23:27:30
Message-ID: 20041218232730.GA22503@dcc.uchile.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 18, 2004 at 02:28:51PM -0800, Joe Conway wrote:

Hi,

> Apparently, either because of the server hang, or because of the flakey
> eth0 interface on reboot, pg_control had become "corrupt". However, it
> was not corrupt in the sense that it contained impossibly invalid data.
> In fact, as pointed out by Alvaro, it had values that all look close to
> those one would find in a recently initdb'd pg_control file, except the
> last modified date:

I can't help remembering the fact that the init script executes an
initdb automatically if it finds an empty data directory (the ones I
know of at least -- does the one you are running?). Maybe what happened
was that it found the empty mount point, executed an initdb, and then
the NFS drive came online. Later, the pg_control file was sync'ed to
the "empty database" settings. It'd be interesting to know if the
mount point does have some files on it.

These values (from the corrupt pg_control file) are strange:

> pg_control last modified: Tue Dec 14 15:39:26 2004
> Time of latest checkpoint: Tue Nov 2 17:05:32 2004

Maybe the latest checkpoint date has some interesting bit pattern that
could explain it somehow.

--
Alvaro Herrera (<alvherre[(at)]dcc(dot)uchile(dot)cl>)
"El sentido de las cosas no viene de las cosas, sino de
las inteligencias que las aplican a sus problemas diarios
en busca del progreso." (Ernesto Hernández-Novich)


From: Joe Conway <mail(at)joeconway(dot)com>
To: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
Cc: Michael Fuhr <mike(at)fuhr(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-18 23:53:06
Message-ID: 41C4C2E2.7090401@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera wrote:
> I can't help remembering the fact that the init script executes an
> initdb automatically if it finds an empty data directory (the ones I
> know of at least -- does the one you are running?). Maybe what happened
> was that it found the empty mount point, executed an initdb, and then
> the NFS drive came online. Later, the pg_control file was sync'ed to
> the "empty database" settings. It'd be interesting to know if the
> mount point does have some files on it.

Good point! I'll take a look at the first opportunity.

> These values (from the corrupt pg_control file) are strange:
>
>>pg_control last modified: Tue Dec 14 15:39:26 2004
>>Time of latest checkpoint: Tue Nov 2 17:05:32 2004
>
> Maybe the latest checkpoint date has some interesting bit pattern that
> could explain it somehow.
>

The last modified corresponds to just prior to the PANIC. See the logs:

2004-12-14 15:39:26 LOG: received smart shutdown request
2004-12-14 15:39:26 LOG: shutting down
2004-12-14 15:39:28 PANIC: could not open file
"/replica/pgdata/pg_xlog/0000000000000000" (log file 0, segment 0): No
such file or directory

The Tue Nov 2 17:05:32 2004 seems to be related to the *previous*
restart; from /var/log/messages:

8<----------------------------------
...
Nov 2 17:04:20 csdfds1 syslogd 1.4.1: restart.
...
Nov 2 17:05:22 csdfds1 su: pam_unix2: session started for user
postgres, service su

...
Nov 2 17:05:33 csdfds1 su: (to postgres) root on /dev/pts/5
Nov 2 17:05:33 csdfds1 su: pam_unix2: session started for user
postgres, service su
Nov 2 17:05:33 csdfds1 su: pam_unix2: session finished for user
postgres, service su
...
8<----------------------------------

Can you make any sense out of that?

Joe


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
Cc: Joe Conway <mail(at)joeconway(dot)com>, Michael Fuhr <mike(at)fuhr(dot)org>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-19 00:01:22
Message-ID: 7174.1103414482@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl> writes:
> These values (from the corrupt pg_control file) are strange:

>> pg_control last modified: Tue Dec 14 15:39:26 2004
>> Time of latest checkpoint: Tue Nov 2 17:05:32 2004

The "last modified" date doesn't prove a lot because it would have
been updated when we set the "state" to "shutting down", just before
the panic occurred when we noticed there wasn't any WAL segment file
where pg_control said there should be one. The "latest checkpoint"
is mighty interesting though.

I think Alvaro's idea that this copy of pg_control got created when the
NFS mount was offline is a real good theory. However, it would seem
that that was quite some time ago (Nov 2 if not earlier), which would
suggest that the mount instability problem has been around longer than
Joe realizes :-(

If the bogus copy is indeed hiding underneath the mount point, then the
sequence of events last week is easy to explain:
* system boots
* NFS mount takes awhile to come online
* Postgres starts and reads the bogus pg_control into memory;
then it just sits there since they didn't try to start any
data loading tasks right away
* eventually NFS mount comes online
* next day, admin decides to shut down Postgres
* Postgres changes last-mod date and state in its in-memory
pg_control, and writes it out, overwriting the "good" copy
on the NFS server
* Postgres then panics because there's no WAL file where
pg_control indicates the shutdown checkpoint WAL record
should go
* and now we're in the state Joe documented

So one thing I'd strongly suggest is stopping Postgres and dismounting
the NFS server to see what's under there. If there is a valid-looking
PGDATA directory under there, you definitely want to get rid of it to
reduce the risk of this happening again.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Michael Fuhr <mike(at)fuhr(dot)org>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-19 00:12:31
Message-ID: 7244.1103415151@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway <mail(at)joeconway(dot)com> writes:
> The Tue Nov 2 17:05:32 2004 seems to be related to the *previous*
> restart; from /var/log/messages:

> Nov 2 17:04:20 csdfds1 syslogd 1.4.1: restart.
> ...
> Nov 2 17:05:22 csdfds1 su: pam_unix2: session started for user
> postgres, service su

> ...
> Nov 2 17:05:33 csdfds1 su: (to postgres) root on /dev/pts/5
> Nov 2 17:05:33 csdfds1 su: pam_unix2: session started for user
> postgres, service su
> Nov 2 17:05:33 csdfds1 su: pam_unix2: session finished for user
> postgres, service su

I'm betting that the "su" at :33 is the invocation of the postmaster.
The fact that it took the script 11 seconds to get to that step is
suggestive to say the least. Are you using one of the scripts that
does an auto initdb if it doesn't see a valid PGDATA? 11 seconds might
be about right for that.

One problem with this theory is how come you didn't get screwed during
*that* boot cycle. It seems to require assuming that the NFS mount came
online just after the initdb finished (else initdb would have
overwritten the on-NFS pg_control) but before the regular postmaster
started (else this same scenario would have played out then). That's
not a very wide window.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: Michael Fuhr <mike(at)fuhr(dot)org>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: pg_resetxlog for 8.0 (was Re: production server down)
Date: 2004-12-19 00:31:21
Message-ID: 7337.1103416281@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway <mail(at)joeconway(dot)com> writes:
> The manpage for pg_resetxlog gives some general idea how it is used, and
> a way to estimate the next transaction id and wal segment.

I had forgotten that that text was in there. It needs to be updated for
8.0 because WAL segment file names are now three-parters. I think
there's also a "MUST FIX FOR 8.0" to-do item here: there needs to be
a way to specify the timeline ID to use. In prior versions we just
allowed pg_resetxlog to set startup ID to 1 all the time, because it
wasn't really being used for anything. But as of 8.0 that's a
significant value and so you'd better be able to set it.

What I'm inclined to do is just widen the "-l" option to take three
numbers instead of 2. That will keep it in sync with what people will
see when they look at WAL file names. Any objections?

regards, tom lane


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Michael Fuhr <mike(at)fuhr(dot)org>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-19 02:51:56
Message-ID: 41C4ECCC.20300@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I think Alvaro's idea that this copy of pg_control got created when the
> NFS mount was offline is a real good theory. However, it would seem
> that that was quite some time ago (Nov 2 if not earlier), which would
> suggest that the mount instability problem has been around longer than
> Joe realizes :-(

I'm starting to wonder if this has somehow happened once or even twice
before, each time the server was restarted. The timing of services
starting on boot might have been biting us all along.

> If the bogus copy is indeed hiding underneath the mount point, then the
> sequence of events last week is easy to explain:
> * system boots
> * NFS mount takes awhile to come online
> * Postgres starts and reads the bogus pg_control into memory;
> then it just sits there since they didn't try to start any
> data loading tasks right away
> * eventually NFS mount comes online
> * next day, admin decides to shut down Postgres
> * Postgres changes last-mod date and state in its in-memory
> pg_control, and writes it out, overwriting the "good" copy
> on the NFS server
> * Postgres then panics because there's no WAL file where
> pg_control indicates the shutdown checkpoint WAL record
> should go
> * and now we're in the state Joe documented
>
> So one thing I'd strongly suggest is stopping Postgres and dismounting
> the NFS server to see what's under there. If there is a valid-looking
> PGDATA directory under there, you definitely want to get rid of it to
> reduce the risk of this happening again.
>

Perhaps we should purposefully place a root owned placeholder file there
-- that way Postgres would refuse to start at all in this scenario.

BTW, the init script is indeed the one which automatically does initdb:

[...]
case "$1" in
start)
touch $LOGFILE
chown postgres:postgres $LOGFILE
chmod 0600 $LOGFILE
if [ ! -f $DATADIR/PG_VERSION ]; then
echo -n "Initializing the PostgreSQL database at location
${DATADIR}"
LANG_SYSCONFIG=/etc/sysconfig/language
test -f "$LANG_SYSCONFIG" && . $LANG_SYSCONFIG
LANG=${POSTGRES_LANG:-$RC_LANG}
install -d -o postgres -g daemon -m 700 ${DATADIR} &&
su - postgres -c "env -i LANG=$LANG initdb $DATADIR &>
initlog" || rc_failed
[...]

Joe


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Michael Fuhr <mike(at)fuhr(dot)org>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-19 03:11:10
Message-ID: 41C4F14E.9050108@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway wrote:

>>
>>
>> So one thing I'd strongly suggest is stopping Postgres and dismounting
>> the NFS server to see what's under there. If there is a valid-looking
>> PGDATA directory under there, you definitely want to get rid of it to
>> reduce the risk of this happening again.
>>
>
> Perhaps we should purposefully place a root owned placeholder file
> there -- that way Postgres would refuse to start at all in this scenario.
>
> BTW, the init script is indeed the one which automatically does initdb:
>
>

ISTM that this should ideally be a sysconfig setting that is picked up
by the init script.

In the absence of that, in your case, certainly the root-owned
placeholder is a good idea - it seems nicer than disabling on-boot
startup altogether if you can avoid that.

cheers

andrew


From: Joe Conway <mail(at)joeconway(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Michael Fuhr <mike(at)fuhr(dot)org>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-19 03:34:02
Message-ID: 41C4F6AA.9030002@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
> In the absence of that, in your case, certainly the root-owned
> placeholder is a good idea - it seems nicer than disabling on-boot
> startup altogether if you can avoid that.

I'm pretty well convinced at this point that a start on boot init script
is inappropriate when working with NFS attached storage. We really do
need to be sure the mount is OK before starting the database. And we
have the watchdogs in place, and people on call at all times,
specifically to deal with "service down" events.

And while we could always modify the init script to rip out the
"helpful" initdb part, I think the under-the-mount safety file is a
must, unless of course someone thinks of an alternate scenario where it
will do more harm than good ;-).

Joe


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Michael Fuhr <mike(at)fuhr(dot)org>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: production server down
Date: 2004-12-27 18:08:36
Message-ID: 41D04FA4.7010402@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Are you using one of the scripts that
> does an auto initdb if it doesn't see a valid PGDATA? 11 seconds might
> be about right for that.
>
> One problem with this theory is how come you didn't get screwed during
> *that* boot cycle. It seems to require assuming that the NFS mount came
> online just after the initdb finished (else initdb would have
> overwritten the on-NFS pg_control) but before the regular postmaster
> started (else this same scenario would have played out then). That's
> not a very wide window.

[followup]
We've now had a chance to bring Postgres down and check under the mount
point. There *is* indeed a newly initdb'd cluster under there. FWIW the
control file is corrupt:

# pg_controldata /home/jconway/pgsql/fds/replica/pgdata
WARNING: Calculated CRC checksum does not match value stored in file.
Either the file is corrupt, or it has a different layout than this program
is expecting. The results below are untrustworthy.

pg_control version number: 72
Catalog version number: 200310211
Database cluster state: in production
pg_control last modified: Sat Feb 6 22:28:16 2106
Current log file ID: 0
Next log file segment: 10161036
Latest checkpoint location: 0/9AA1B4
Prior checkpoint location: 0/9B0B8C
Latest checkpoint's REDO location: 0/0
Latest checkpoint's UNDO location: C/218
Latest checkpoint's StartUpID: 17142
Latest checkpoint's NextXID: 1099443932
Latest checkpoint's NextOID: 8192
Time of latest checkpoint: Wed Apr 8 07:05:36 6325
Database block size: 1
Blocks per segment of large relation: 128
Maximum length of identifiers: 67
Maximum number of function arguments: 0
Date/time type storage: floating-point numbers
Maximum length of locale name: 0
LC_COLLATE:
LC_CTYPE:

I have a tarred copy of the under-the-mount PGDATA if anyone is
interested in examining it.

BTW, there was another Postgres cluster on this same server which we had
not used since the November 2 reboot -- it was corrupt in pretty much
the same way and also had an initdb'd cluster under its mount.

So it looks like using an auto initdb startup script is a very bad idea
when using an NFS mounted PGDATA. We left the under-mount structure in
place and did "chown root:root" and "chmod 000" on it. And, as mentioned
in an earlier post, we now rely on the dba to start postgres manually
after a server restart.

Joe