Re: 9.2.3 crashes during archive recovery

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, michael(dot)paquier(at)gmail(dot)com, ants(at)cybertec(dot)at, simon(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.2.3 crashes during archive recovery
Date: 2013-03-07 10:41:42
Message-ID: 51386EE6.6080901@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07.03.2013 10:05, KONDO Mitsumasa wrote:
> (2013/03/06 16:50), Heikki Linnakangas wrote:>
>> Yeah. That fix isn't right, though; XLogPageRead() is supposed to
>> return true on success, and false on error, and the patch makes it
>> return 'true' on error, if archive recovery was requested but we're
>> still in crash recovery. The real issue here is that I missed the two
>> "return NULL;"s in ReadRecord(), so the code that I put in the
>> next_record_is_invalid codepath isn't run if XLogPageRead() doesn't
>> find the file at all. Attached patch is the proper fix for this.
>>
> Thanks for createing patch! I test your patch in 9.2_STABLE, but it does
> not use promote command...
> When XLogPageRead() was returned false ,it means the end of stanby loop,
> crash recovery loop, and archive recovery loop.
> Your patch is not good for promoting Standby to Master. It does not come
> off standby loop.
>
> So I make new patch which is based Heikki's and Horiguchi's patch.

Ah, I see. I committed a slightly modified version of this.

>>> I also found a bug in latest 9.2_stable. It does not get latest timeline
>>> and
>>> recovery history file in archive recovery when master and standby
>>> timeline is different.
>>
>> Works for me.. Can you create a test script for that? Remember to set
>> "recovery_target_timeline='latest'".
> ...
> It can be reproduced in my test script, too.

I see the problem now, with that script. So what happens is that the
startup process first scans the timeline history files to choose the
recovery target timeline. For that scan, I temporarily set
InArchiveRecovery=true, in readRecoveryCommandFile. However, after
readRecoveryCommandFile returns, we then try to read the timeline
history file corresponding the chosen recovery target timeline, but
InArchiveRecovery is no longer set, so we don't fetch the file from
archive, and return a "dummy" history, with just the target timeline in
it. That doesn't contain the older timeline, so you get an error at
recovery.

Fixed per your patch to check for ArchiveRecoveryRequested instead of
InArchiveRecovery, when reading timeline history files. This also makes
it unnecessary to temporarily set InArchiveRecovery=true, so removed that.

Committed both fixes. Please confirm this this fixed the problem in your
test environment. Many thanks for the testing and the patches!

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-03-07 12:59:20 Re: Performance Improvement by reducing WAL for Update Operation
Previous Message Dimitri Fontaine 2013-03-07 09:54:27 Re: sql_drop Event Trigger