Re: Switching timeline over streaming replication

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Thom Brown <thom(at)linux(dot)com>
Subject: Re: Switching timeline over streaming replication
Date: 2012-12-23 14:37:04
Message-ID: CAHGQGwF9QMX9tu2hQByycNVVe1o6gdAgjVbbO52vmwj5pgLR=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 21, 2012 at 1:48 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Sat, Dec 15, 2012 at 9:36 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Sat, Dec 8, 2012 at 12:51 AM, Heikki Linnakangas
>> <hlinnakangas(at)vmware(dot)com> wrote:
>>> On 06.12.2012 15:39, Amit Kapila wrote:
>>>>
>>>> On Thursday, December 06, 2012 12:53 AM Heikki Linnakangas wrote:
>>>>>
>>>>> On 05.12.2012 14:32, Amit Kapila wrote:
>>>>>>
>>>>>> On Tuesday, December 04, 2012 10:01 PM Heikki Linnakangas wrote:
>>>>>>>
>>>>>>> After some diversions to fix bugs and refactor existing code, I've
>>>>>>> committed a couple of small parts of this patch, which just add some
>>>>>>> sanity checks to notice incorrect PITR scenarios. Here's a new
>>>>>>> version of the main patch based on current HEAD.
>>>>>>
>>>>>>
>>>>>> After testing with the new patch, the following problems are observed.
>>>>>>
>>>>>> Defect - 1:
>>>>>>
>>>>>> 1. start primary A
>>>>>> 2. start standby B following A
>>>>>> 3. start cascade standby C following B.
>>>>>> 4. start another standby D following C.
>>>>>> 5. Promote standby B.
>>>>>> 6. After successful time line switch in cascade standby C& D,
>>>>>
>>>>> stop D.
>>>>>>
>>>>>> 7. Restart D, Startup is successful and connecting to standby C.
>>>>>> 8. Stop C.
>>>>>> 9. Restart C, startup is failing.
>>>>>
>>>>>
>>>>> Ok, the error I get in that scenario is:
>>>>>
>>>>> C 2012-12-05 19:55:43.840 EET 9283 FATAL: requested timeline 2 does not
>>>>> contain minimum recovery point 0/3023F08 on timeline 1 C 2012-12-05
>>>>> 19:55:43.841 EET 9282 LOG: startup process (PID 9283) exited with exit
>>>>> code 1 C 2012-12-05 19:55:43.841 EET 9282 LOG: aborting startup due to
>>>>> startup process failure
>>>>>
>>>>
>>>>>
>>>>> That mismatch causes the error. I'd like to fix this by always treating
>>>>> the checkpoint record to be part of the new timeline. That feels more
>>>>> correct. The most straightforward way to implement that would be to peek
>>>>> at the xlog record before updating replayEndRecPtr and replayEndTLI. If
>>>>> it's a checkpoint record that changes TLI, set replayEndTLI to the new
>>>>> timeline before calling the redo-function. But it's a bit of a
>>>>> modularity violation to peek into the record like that.
>>>>>
>>>>> Or we could just revert the sanity check at beginning of recovery that
>>>>> throws the "requested timeline 2 does not contain minimum recovery point
>>>>> 0/3023F08 on timeline 1" error. The error I added to redo of checkpoint
>>>>> record that says "unexpected timeline ID %u in checkpoint record, before
>>>>> reaching minimum recovery point %X/%X on timeline %u" checks basically
>>>>> the same thing, but at a later stage. However, the way
>>>>> minRecoveryPointTLI is updated still seems wrong to me, so I'd like to
>>>>> fix that.
>>>>>
>>>>> I'm thinking of something like the attached (with some more comments
>>>>> before committing). Thoughts?
>>>>
>>>>
>>>> This has fixed the problem reported.
>>>> However, I am not able to think will there be any problem if we remove
>>>> check
>>>> "requested timeline 2 does not contain minimum recovery point
>>>>>
>>>>> 0/3023F08 on timeline 1" at beginning of recovery and just update
>>>>
>>>> replayEndTLI with ThisTimeLineID?
>>>
>>>
>>> Well, it seems wrong for the control file to contain a situation like this:
>>>
>>> pg_control version number: 932
>>> Catalog version number: 201211281
>>> Database system identifier: 5819228770976387006
>>> Database cluster state: shut down in recovery
>>> pg_control last modified: pe 7. joulukuuta 2012 17.39.57
>>> Latest checkpoint location: 0/3023EA8
>>> Prior checkpoint location: 0/2000060
>>> Latest checkpoint's REDO location: 0/3023EA8
>>> Latest checkpoint's REDO WAL file: 000000020000000000000003
>>> Latest checkpoint's TimeLineID: 2
>>> ...
>>> Time of latest checkpoint: pe 7. joulukuuta 2012 17.39.49
>>> Min recovery ending location: 0/3023F08
>>> Min recovery ending loc's timeline: 1
>>>
>>> Note the latest checkpoint location and its TimelineID, and compare them
>>> with the min recovery ending location. The min recovery ending location is
>>> ahead of latest checkpoint's location; the min recovery ending location
>>> actually points to the end of the checkpoint record. But how come the min
>>> recovery ending location's timeline is 1, while the checkpoint record's
>>> timeline is 2.
>>>
>>> Now maybe that would happen to work if remove the sanity check, but it still
>>> seems horribly confusing. I'm afraid that discrepancy will come back to
>>> haunt us later if we leave it like that. So I'd like to fix that.
>>>
>>> Mulling over this for some more, I propose the attached patch. With the
>>> patch, we peek into the checkpoint record, and actually perform the timeline
>>> switch (by changing ThisTimeLineID) before replaying it. That way the
>>> checkpoint record is really considered to be on the new timeline for all
>>> purposes. At the moment, the only difference that makes in practice is that
>>> we set replayEndTLI, and thus minRecoveryPointTLI, to the new TLI, but it
>>> feels logically more correct to do it that way.
>>
>> This patch has already been included in HEAD. Right?
>>
>> I found another "requested timeline does not contain minimum recovery point"
>> error scenario in HEAD:
>>
>> 1. Set up the master 'M', one standby 'S1', and one cascade standby 'S2'.
>> 2. Shutdown the master 'M' and promote the standby 'S1', and wait for 'S2'
>> to reconnect to 'S1'.
>> 3. Set up new cascade standby 'S3' connecting to 'S2'.
>> Then 'S3' fails to start the recovery because of the following error:
>>
>> FATAL: requested timeline 2 does not contain minimum recovery
>> point 0/3000000 on timeline 1
>> LOG: startup process (PID 33104) exited with exit code 1
>> LOG: aborting startup due to startup process failure
>>
>> The result of pg_controldata of 'S3' is:
>>
>> Latest checkpoint location: 0/3000088
>> Prior checkpoint location: 0/2000060
>> Latest checkpoint's REDO location: 0/3000088
>> Latest checkpoint's REDO WAL file: 000000020000000000000003
>> Latest checkpoint's TimeLineID: 2
>> <snip>
>> Min recovery ending location: 0/3000000
>> Min recovery ending loc's timeline: 1
>> Backup start location: 0/0
>> Backup end location: 0/0
>>
>> The content of the timeline history file '00000002.history' is:
>>
>> 1 0/3000088 no recovery target specified
>
> I still could reproduce this problem. Attached is the shell script
> which reproduces the problem.

This problem happens when new standby starts up from the backup
taken from another standby and its recovery starts from the shutdown
checkpoint record which causes timeline switch. In this case,
the timeline of minimum recovery point can be different from that of
latest checkpoint (i.e., shutdown checkpoint). But the following check
in StartupXLOG() assumes that they are always the same wrongly.
So the problem happens.

/*
* The min recovery point should be part of the requested timeline's
* history, too.
*/
if (!XLogRecPtrIsInvalid(ControlFile->minRecoveryPoint) &&
tliOfPointInHistory(ControlFile->minRecoveryPoint - 1, expectedTLEs) !=
ControlFile->minRecoveryPointTLI)
ereport(FATAL,
(errmsg("requested timeline %u does not contain minimum recovery
point %X/%X on timeline %u",
recoveryTargetTLI,
(uint32) (ControlFile->minRecoveryPoint >> 32),
(uint32) ControlFile->minRecoveryPoint,
ControlFile->minRecoveryPointTLI)));

If we don't have such check, in the later phase of recovery,
the minimum recovery point is initialized to the latest checkpoint
location as follows. This strikes me that the timeline of minimum
recovery point should be check after it's initialized. So ISTM that
the right fix of the problem is to move the above check after the
following initialization. Thought?

/* initialize minRecoveryPoint if not set yet */
if (XLByteLT(ControlFile->minRecoveryPoint, checkPoint.redo))
{
ControlFile->minRecoveryPoint = checkPoint.redo;
ControlFile->minRecoveryPointTLI = checkPoint.ThisTimeLineID;
}

Regards,

--
Fujii Masao

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2012-12-23 14:41:24 Re: Re: patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap [Review]
Previous Message Phil Sorber 2012-12-23 14:37:02 Re: [WIP] pg_ping utility