Re: Serious problem: media recovery fails after system or PostgreSQL crash

Lists: pgsql-hackers
From: "MauMau" <maumau307(at)gmail(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Serious problem: media recovery fails after system or PostgreSQL crash
Date: 2012-12-06 14:41:39
Message-ID: A70482CA20CD460CB1053F2177CD7789@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

Although this may have to be posted to pgsql-bugs or pgsql-general, let me
ask you here because the problem probably needs PostgreSQL's code fix.

[Problem]
I'm using PostgreSQL 9.1.6 on Linux. I encountered a serious problem that
media recovery failed showing the following message:

FATAL: archive file "000000010000008000000028" has wrong size: 7340032
instead of 16777216

I'm using normal cp command to archive WAL files. That is:

archive_command = '/path/to/my_script.sh "%p" "/backup/archive_log/%f"'

<<my_script.sh>>
--------------------------------------------------
#!/bin/sh
some processing...
cp "$1" "$2"
other processing...
--------------------------------------------------

The media recovery was triggered by power failure. The disk drive that
stored $PGDATA failed after a power failure. So I replaced the failed disk,
and performed media recovery by creating recovery.conf and running pg_ctl
start. However, pg_ctl failed with the above error message.

[Cause]
The cause is clear from the message. PostgreSQL refuses to continue media
recovery when it finds an archived WAL file whose size is not 16 MB. The
relevant code is in src/backend/access/transam/xlog.c:

--------------------------------------------------
if (expectedSize > 0 && stat_buf.st_size != expectedSize)
{
int elevel;

/*
* If we find a partial file in standby mode, we assume it's
* because it's just being copied to the archive, and keep
* trying.
*
* Otherwise treat a wrong-sized file as FATAL to ensure the
* DBA would notice it, but is that too strong? We could try
* to plow ahead with a local copy of the file ... but the
* problem is that there probably isn't one, and we'd
* incorrectly conclude we've reached the end of WAL and we're
* done recovering ...
*/
if (StandbyMode && stat_buf.st_size < expectedSize)
elevel = DEBUG1;
else
elevel = FATAL;
ereport(elevel,
(errmsg("archive file \"%s\" has wrong size: %lu instead of %lu",
xlogfname,
(unsigned long) stat_buf.st_size,
(unsigned long) expectedSize)));
return false;
}
--------------------------------------------------

[How to fix]
Archived files can become smaller than their expected sizes for some
reasons:

1. The power fails while archive_command is copying files (as in my case).
2. Immediate shutdown (pg_ctl stop -mi) is performed while archive_command
is copying files. In this case, cp or equivalent copying command is
cancelled by SIGQUIT sent by postmaster.

Therefore, I think postgres must continue recovery by fetching files from
pg_xlog/ when it encounters a partially filled archive files. In addition,
it may be necessary to remove the partially filled archived files, because
they might prevent media recovery in the future (is this true?). I mean we
need the following fix. What do you think?

--------------------------------------------------
if (expectedSize > 0 && stat_buf.st_size != expectedSize)
{
int elevel;
...
if (StandbyMode && stat_buf.st_size < expectedSize)
elevel = DEBUG1;
else
{
elevel = LOG;
unlink(xlogpath);
}
ereport(elevel,
(errmsg("archive file \"%s\" has wrong size: %lu instead of %lu",
xlogfname,
(unsigned long) stat_buf.st_size,
(unsigned long) expectedSize)));
return false;
}
--------------------------------------------------

I've heard that the next minor release is scheduled during this weekend. I
really wish this problem will be fixed in that release. If you wish, I'll
post the patch tomorrow or the next day. Could you include the fix in the
weekend release?

Regards
MauMau


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "MauMau" <maumau307(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Serious problem: media recovery fails after system or PostgreSQL crash
Date: 2012-12-06 17:33:23
Message-ID: 5696.1354815203@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"MauMau" <maumau307(at)gmail(dot)com> writes:
> I'm using PostgreSQL 9.1.6 on Linux. I encountered a serious problem that
> media recovery failed showing the following message:
> FATAL: archive file "000000010000008000000028" has wrong size: 7340032
> instead of 16777216

Well, that's unfortunate, but it's not clear that automatic recovery is
possible. The only way out of it would be if an undamaged copy of the
segment was in pg_xlog/ ... but if I recall the logic correctly, we'd
not even be trying to fetch from the archive if we had a local copy.

> Therefore, I think postgres must continue recovery by fetching files from
> pg_xlog/ when it encounters a partially filled archive files. In addition,
> it may be necessary to remove the partially filled archived files, because
> they might prevent media recovery in the future (is this true?). I mean we
> need the following fix. What do you think?

I think having PG automatically destroy archive files is bordering on
insane. It might be reasonable for the archiving process to do
something like this, if it has a full-size copy of the file available
to replace the damaged copy with. But otherwise you're just throwing
away what's probably the only copy of useful data.

> I've heard that the next minor release is scheduled during this weekend. I
> really wish this problem will be fixed in that release. If you wish, I'll
> post the patch tomorrow or the next day. Could you include the fix in the
> weekend release?

Even if this were a good and uncontroversial idea, I'm afraid you're
several days too late. The release is today, not next week.

regards, tom lane


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Serious problem: media recovery fails after system or PostgreSQL crash
Date: 2012-12-06 22:45:14
Message-ID: 972E1E3640674093B1CF8C329D03F80F@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Well, that's unfortunate, but it's not clear that automatic recovery is
> possible. The only way out of it would be if an undamaged copy of the
> segment was in pg_xlog/ ... but if I recall the logic correctly, we'd
> not even be trying to fetch from the archive if we had a local copy.

No, PG will try to fetch the WAL file from pg_xlog when it cannot get it
from archive. XLogFileReadAnyTLI() does that. Also, PG manual contains the
following description:

http://www.postgresql.org/docs/9.1/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL

WAL segments that cannot be found in the archive will be sought in pg_xlog/;
this allows use of recent un-archived segments. However, segments that are
available from the archive will be used in preference to files in pg_xlog/.

So, continuing recovery by changing the emode to LOG would work. What do
you think about this fix?

> I think having PG automatically destroy archive files is bordering on
> insane.

I agree. Before that, PG cannot know the archive location, so PG cannot
delete the partially filled archive files.

Regards
MauMau


From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Serious problem: media recovery fails after system or PostgreSQL crash
Date: 2012-12-06 23:52:12
Message-ID: 50C12FAC.1010503@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 6.12.2012 23:45, MauMau wrote:
> From: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
>> Well, that's unfortunate, but it's not clear that automatic recovery is
>> possible. The only way out of it would be if an undamaged copy of the
>> segment was in pg_xlog/ ... but if I recall the logic correctly, we'd
>> not even be trying to fetch from the archive if we had a local copy.
>
> No, PG will try to fetch the WAL file from pg_xlog when it cannot get it
> from archive. XLogFileReadAnyTLI() does that. Also, PG manual contains
> the following description:
>
> http://www.postgresql.org/docs/9.1/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL
>
>
> WAL segments that cannot be found in the archive will be sought in
> pg_xlog/; this allows use of recent un-archived segments. However,
> segments that are available from the archive will be used in preference
> to files in pg_xlog/.

So why don't you use an archive command that does not create such
incomplete files? I mean something like this:

archive_command = 'cp %p /arch/%f.tmp && mv /arch/%f.tmp /arch/%f'

Until the file is renamed, it's considered 'incomplete'.

Tomas


From: Daniel Farina <daniel(at)heroku(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: MauMau <maumau307(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Serious problem: media recovery fails after system or PostgreSQL crash
Date: 2012-12-07 00:16:54
Message-ID: CAAZKuFY2vYRb=-CzGKPKoqSApwyhWqzg+Hs30Sbk-ueq-+zieA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 6, 2012 at 9:33 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "MauMau" <maumau307(at)gmail(dot)com> writes:
>> I'm using PostgreSQL 9.1.6 on Linux. I encountered a serious problem that
>> media recovery failed showing the following message:
>> FATAL: archive file "000000010000008000000028" has wrong size: 7340032
>> instead of 16777216
>
> Well, that's unfortunate, but it's not clear that automatic recovery is
> possible. The only way out of it would be if an undamaged copy of the
> segment was in pg_xlog/ ... but if I recall the logic correctly, we'd
> not even be trying to fetch from the archive if we had a local copy.

I'm inclined to agree with this: I've had a case much like the
original poster (too-short WAL segments because of media issues),
except in my case the archiver had archived a bogus copy of the data
also (short length and all), so our attempt to recover from archives
on a brand new system was met with obscure failure[0]. And, rather
interestingly, the WAL disk was able to "write" bogusly without error
for many minutes, which made for a fairly exotic recovery based on
pg_resetxlog: I grabbed quite a few minutes of WAL of various sub-16MB
sizes to spot check the situation.

It never occurred to me there was a way to really fix this that still
involves the archiver reading from a file system: what can one do when
one no longer trusts read() to get what was write()d?

[0]: Postgres wasn't very good about reporting the failure: in the
case bogus files have been restored from archives, it seems to just
bounce through timelines it knows about searching for a WAL it likes,
without any real error messaging like got "corrupt wal from archive".
That could probably be fixed.

--
fdr


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Serious problem: media recovery fails after system or PostgreSQL crash
Date: 2012-12-07 14:21:32
Message-ID: A78791C96DD04D6CBC81A1F44C9E15CE@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tomas wrote:
> So why don't you use an archive command that does not create such
> incomplete files? I mean something like this:
>
> archive_command = 'cp %p /arch/%f.tmp && mv /arch/%f.tmp /arch/%f'
>
> Until the file is renamed, it's considered 'incomplete'.

That is a good idea. This would surely solve my problem. Don't you think
this must be described in PG manual?

Now there are two solutions given. I think it would be better to implement
both of these.

1. Tomas's idea
Instruct the method like below in the PG manual.

archive_command = 'cp %p /arch/%f.tmp && mv /arch/%f.tmp /arch/%f'

[Merits]
The incomplete file will not eventually remain. The successful archival
erases the incomplete file created during the previous interrupted archival.

[Demerits]
The administrator or developer of existing systems or products using
PostgreSQL will not be likely to notice this change in the PG manual. So
they might face the media recovery failure.

2. My idea
PG continues media recovery when it encounters a partially filled WAL
segment file. It ereport(LOG)'s the unexpected size, and try to fetch the
WAL file from pg_xlog/. This is logically natural. If the archived WAL
file is smaller than the expected size, it (almost always) means that
archive processing was interrupted and the unarchived WAL file is in
pg_xlog/. So PG should instead the same WAL file from pg_xlog/ and continue
recovery. PG will then try to fetch the subsequent WAL files from archive
again, but will hardly find those files there and fetch them from pg_xlog/.
This behavior complies with the description in the PG manual.

I mean just changing FATAL to LOG in the code in RestoreArchivedFile():

--------------------------------------------------
if (StandbyMode && stat_buf.st_size < expectedSize)
elevel = DEBUG1;
else
elevel = FATAL;
ereport(elevel,
(errmsg("archive file \"%s\" has wrong size: %lu instead of %lu",
xlogfname,
(unsigned long) stat_buf.st_size,
(unsigned long) expectedSize)));
return false;
--------------------------------------------------

[Merits]
The administrator or developer of existing systems or products using
PostgreSQL will benefit from this fix.

[Demerits]
The incomplete archive file(s) might be left forever unless the DBA delete
them. That occurs if pg_xlog/ is lost and the incomplete archive files will
not be overwritten.

Could you give me opinions what to do? I'm willing to submit these fixes.

Regards
MauMau


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Serious problem: media recovery fails after system or PostgreSQL crash
Date: 2012-12-08 02:08:08
Message-ID: CAMkU=1zr9UM=OQOMTrYo6qHbZRf57EnukMpp0PeKj-71XUf8+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 6, 2012 at 3:52 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
> Hi,
>
> On 6.12.2012 23:45, MauMau wrote:
>> From: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
>>> Well, that's unfortunate, but it's not clear that automatic recovery is
>>> possible. The only way out of it would be if an undamaged copy of the
>>> segment was in pg_xlog/ ... but if I recall the logic correctly, we'd
>>> not even be trying to fetch from the archive if we had a local copy.
>>
>> No, PG will try to fetch the WAL file from pg_xlog when it cannot get it
>> from archive. XLogFileReadAnyTLI() does that. Also, PG manual contains
>> the following description:
>>
>> http://www.postgresql.org/docs/9.1/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL
>>
>>
>> WAL segments that cannot be found in the archive will be sought in
>> pg_xlog/; this allows use of recent un-archived segments. However,
>> segments that are available from the archive will be used in preference
>> to files in pg_xlog/.
>
> So why don't you use an archive command that does not create such
> incomplete files? I mean something like this:
>
> archive_command = 'cp %p /arch/%f.tmp && mv /arch/%f.tmp /arch/%f'
>
> Until the file is renamed, it's considered 'incomplete'.

Wouldn't having the incomplete file be preferable over having none of it at all?

It seems to me you need considerable expertise to figure out how to do
optimal recovery (i.e. losing the least transactions) in this
situation, and that that expertise cannot be automated. Do you trust
a partial file from a good hard drive, or a complete file from a
partially melted pg_xlog?

Cheers,

Jeff


From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Serious problem: media recovery fails after system or PostgreSQL crash
Date: 2012-12-16 16:38:09
Message-ID: 50CDF8F1.6040808@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 8.12.2012 03:08, Jeff Janes wrote:
> On Thu, Dec 6, 2012 at 3:52 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> Hi,
>>
>> On 6.12.2012 23:45, MauMau wrote:
>>> From: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
>>>> Well, that's unfortunate, but it's not clear that automatic recovery is
>>>> possible. The only way out of it would be if an undamaged copy of the
>>>> segment was in pg_xlog/ ... but if I recall the logic correctly, we'd
>>>> not even be trying to fetch from the archive if we had a local copy.
>>>
>>> No, PG will try to fetch the WAL file from pg_xlog when it cannot get it
>>> from archive. XLogFileReadAnyTLI() does that. Also, PG manual contains
>>> the following description:
>>>
>>> http://www.postgresql.org/docs/9.1/static/continuous-archiving.html#BACKUP-ARCHIVING-WAL
>>>
>>>
>>> WAL segments that cannot be found in the archive will be sought in
>>> pg_xlog/; this allows use of recent un-archived segments. However,
>>> segments that are available from the archive will be used in preference
>>> to files in pg_xlog/.
>>
>> So why don't you use an archive command that does not create such
>> incomplete files? I mean something like this:
>>
>> archive_command = 'cp %p /arch/%f.tmp && mv /arch/%f.tmp /arch/%f'
>>
>> Until the file is renamed, it's considered 'incomplete'.
>
> Wouldn't having the incomplete file be preferable over having none of it at all?
>
> It seems to me you need considerable expertise to figure out how to do
> optimal recovery (i.e. losing the least transactions) in this
> situation, and that that expertise cannot be automated. Do you trust
> a partial file from a good hard drive, or a complete file from a
> partially melted pg_xlog?

It clearly is a rather complex issue, no doubt about that. And yes,
reliability of the devices with pg_xlog on them is an important detail.
Alghough if the WAL is not written in a reliable way, you're hosed
anyway I guess.

The recommended archive command is based on the assumption that the
local pg_xlog is intact (e.g. because it's located on a reliable RAID1
array), which seems to be the assumption of the OP too.

In my opinion it's more likely to meet an incomplete copy of WAL in the
archive than a corrupted local WAL. And if it really is corrupted, it
would be identified during replay.

Tomas


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Serious problem: media recovery fails after system or PostgreSQL crash
Date: 2012-12-17 21:30:29
Message-ID: CAMkU=1yf5+vrO7gmBOsLLykKuDnN5naN=yPGtEs93NmJ7uhefg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Dec 16, 2012 at 8:38 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
> On 8.12.2012 03:08, Jeff Janes wrote:
>>
>> It seems to me you need considerable expertise to figure out how to do
>> optimal recovery (i.e. losing the least transactions) in this
>> situation, and that that expertise cannot be automated. Do you trust
>> a partial file from a good hard drive, or a complete file from a
>> partially melted pg_xlog?
>
> It clearly is a rather complex issue, no doubt about that. And yes,
> reliability of the devices with pg_xlog on them is an important detail.
> Alghough if the WAL is not written in a reliable way, you're hosed
> anyway I guess.
>
> The recommended archive command is based on the assumption that the
> local pg_xlog is intact (e.g. because it's located on a reliable RAID1
> array), which seems to be the assumption of the OP too.
>
> In my opinion it's more likely to meet an incomplete copy of WAL in the
> archive than a corrupted local WAL. And if it really is corrupted, it
> would be identified during replay.

Wouldn't the way it would be identified be for it to fail a checksum,
assume it was garbage left over from the previous WAL file which was
the process of being overwritten at the time of crash, and so
terminate recovery and open the database?

Assuming your goal is to recover all the transactions you possibly can
(rather than restore to a known point), I think you would want to try
recovery both ways and keep whichever one got the furthest.

Cheers,

Jeff