Hot strandby fails to restart after pg_subtrans corruption (?)

From: Paul Guyot <paulguyot(at)ieee(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: Hot strandby fails to restart after pg_subtrans corruption (?)
Date: 2012-05-10 16:22:34
Message-ID: FCCF0B65-62EA-4E4C-9E99-10C47FC339A4@ieee.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

After a hardware reboot, a hot standby server fails to start.
The log mentions pg_subtrans files. While the machine got a hard reboot, underlying filesystem is ZFS which did not detect or report any corruption.

For the first restart after the reboot, the server (then 9.1.0) complained that a file did not exist.
May 10 14:16:49 citrinus postgres[1914]: [6-1] LOG: file "pg_subtrans/B7E5" doesn't exist, reading as zeroes

Considering this may be related to this thread :
http://archives.postgresql.org/pgsql-general/2011-12/msg00291.php

postgres was cleanly restarted during the recovery :
May 10 14:22:30 citrinus postgres[1489]: [1-1] LOG: received fast shutdown request
May 10 14:22:30 citrinus postgres[1916]: [1-1] LOG: shutting down
May 10 14:22:31 citrinus postgres[1916]: [2-1] LOG: database system is shut down

and then updated to 9.1.3.

Now, the server just fails later on the same file.

----
May 10 15:48:05 citrinus postgres[39203]: [1-1] LOG: database system was interrupted while in recovery at log time 2012-04-25 07:53:33 UTC
May 10 15:48:05 citrinus postgres[39203]: [1-2] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
May 10 15:48:05 citrinus postgres[39203]: [2-1] LOG: entering standby mode
May 10 15:48:07 citrinus postgres[39203]: [3-1] LOG: restored log file "00000001000006D600000017" from archive
May 10 15:48:10 citrinus postgres[39203]: [4-1] LOG: restored log file "00000001000006D500000068" from archive
May 10 15:48:10 citrinus postgres[39203]: [5-1] LOG: redo starts at 6D5/684B13E8
May 10 15:49:47 citrinus postgres[39203]: [6-1] LOG: restored log file "00000001000006D500000069" from archive
May 10 15:50:37 citrinus postgres[39203]: [7-1] LOG: restored log file "00000001000006D50000006A" from archive
May 10 15:51:19 citrinus postgres[39203]: [8-1] LOG: restored log file "00000001000006D50000006B" from archive
May 10 15:52:05 citrinus postgres[39203]: [9-1] LOG: restored log file "00000001000006D50000006C" from archive
May 10 15:53:06 citrinus postgres[39203]: [10-1] LOG: restored log file "00000001000006D50000006D" from archive
May 10 15:53:24 citrinus postgres[39203]: [11-1] FATAL: could not access status of transaction 3085299721
May 10 15:53:24 citrinus postgres[39203]: [11-2] DETAIL: Could not read from file "pg_subtrans/B7E5" at offset 245760: No error: 0.
May 10 15:53:24 citrinus postgres[39203]: [11-3] CONTEXT: xlog redo xid assignment xtop 3085293107: subxacts: 3085299449 3085299451 3085299462 3085299465 3085299469 3085299473 3085299478 3085299485 3085299487 3085299489 3085299491 3085299494 3085299497 3085299498 3085299543 3085299545 3085299547 3085299549 3085299551 3085299553 3085299555 3085299557 3085299558 3085299559 3085299560 3085299561 3085299562 3085299563 3085299564 3085299567 3085299568 3085299569 3085299570 3085299571 3085299573 3085299574 3085299576 3085299578 3085299580 3085299582 3085299594 3085299596 3085299607 3085299608 3085299611 3085299635 3085299646 3085299648 3085299650 3085299652 3085299654 3085299656 3085299658 3085299660 3085299663 3085299665 3085299667 3085299669 3085299671 3085299673 3085299681 3085299683 3085299687 3085299721
May 10 15:53:24 citrinus postgres[39202]: [1-1] LOG: startup process (PID 39203) exited with exit code 1
May 10 15:53:24 citrinus postgres[39202]: [2-1] LOG: terminating any other active server processes
----

File B7E5 has exactly 245760 bytes.

If I remove the B7E5 file as well as file B7F6 which was created during recovery, and retry, it fails later on like this:

----
May 10 16:15:53 citrinus postgres[41142]: [75-1] LOG: file "pg_subtrans/B7E6" doesn't exist, reading as zeroes
May 10 16:15:53 citrinus postgres[41142]: [75-2] CONTEXT: xlog redo xid assignment xtop 3085346439: subxacts: 3085364307 3085364352 3085364355 3085364358 3085364363 3085364404 3085364407 3085364409 3085364496 3085364518 3085364532 3085364568 3085364571 3085364611 3085364625 3085364660 3085364663 3085364668 3085364671 3085364715 3085364719 3085364727 3085364738 3085364740 3085364768 3085364770 3085364773 3085364792 3085364800 3085364803 3085364806 3085364817 3085364819 3085364829 3085364830 3085364832 3085364854 3085364874 3085364877 3085364885 3085364886 3085364889 3085364890 3085364891 3085364911 3085364914 3085364918 3085364920 3085364964 3085364988 3085364998 3085365001 3085365013 3085365014 3085365015 3085365016 3085365068 3085365069 3085365071 3085365106 3085365183 3085365284 3085365287 3085365288
May 10 16:16:10 citrinus postgres[41142]: [76-1] LOG: restored log file "00000001000006D500000077" from archive
May 10 16:16:13 citrinus postgres[41142]: [77-1] FATAL: could not access status of transaction 3085367299
May 10 16:16:13 citrinus postgres[41142]: [77-2] DETAIL: Could not read from file "pg_subtrans/B7E6" at offset 253952: No error: 0.
May 10 16:16:13 citrinus postgres[41142]: [77-3] CONTEXT: xlog redo xid assignment xtop 3085346439: subxacts: 3085366820 3085366821 3085366822 3085366823 3085366824 3085366825 3085366826 3085366827 3085366828 3085366829 3085366830 3085366831 3085366832 3085366833 3085366834 3085366835 3085366836 3085366837 3085366838 3085366860 3085366905 3085366919 3085366920 3085366949 3085366965 3085367040 3085367057 3085367059 3085367062 3085367076 3085367168 3085367170 3085367174 3085367177 3085367180 3085367182 3085367183 3085367184 3085367185 3085367202 3085367205 3085367208 3085367212 3085367215 3085367264 3085367266 3085367269 3085367272 3085367274 3085367277 3085367299 3085367301 3085367305 3085367309 3085367311 3085367331 3085367334 3085367337 3085367338 3085367339 3085367341 3085367343 3085367345 3085367351
May 10 16:16:13 citrinus postgres[41141]: [1-1] LOG: startup process (PID 41142) exited with exit code 1
May 10 16:16:13 citrinus postgres[41141]: [2-1] LOG: terminating any other active server processes
----

At this point, file B7E6 was created and is 8KB long.
-rw------- 1 pgsql wheel 8192 May 10 16:16 B7E6

What are the steps to recover the standby?

Paul
--
Semiocast http://semiocast.com/
Semioboard http://semioboard.com/
+33.183627948 - 20 rue Lacaze, 75014 Paris

Browse pgsql-general by date

  From Date Subject
Next Message Evan Carroll 2012-05-10 16:32:52 Custom Domain; migration from 8.4 to 9.1 and COLLATE
Previous Message Horaci Macias 2012-05-10 16:08:06 Re: vacuum, vacuum full and problems releasing disk space