Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave
Date: 2013-01-22 00:25:05
Message-ID: CAB7nPqTGnonaydRDx2KQoLAt+AM_nMFqeR6inYZZAo8EeHKwfw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 22, 2013 at 9:06 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com>wrote:

>
>
> On Fri, Jan 18, 2013 at 6:20 PM, Heikki Linnakangas <
> hlinnakangas(at)vmware(dot)com> wrote:
>
>> Hmm, so it's the same issue I thought I fixed yesterday. My patch only
>> fixed it for the case that the timeline switch is in the first page of the
>> segment. When it's not, you still get two calls for a WAL record, first one
>> for the first page in the segment, to verify that, and then the page that
>> actually contains the record. The first call leads XLogPageRead to think it
>> needs to read from the old timeline.
>>
>> We didn't have this problem before the xlogreader refactoring because
>> XLogPageRead() was always called with the RecPtr of the record, even when
>> we actually read the segment header from the file first. We'll have to
>> somehow get that same information, the RecPtr of the record we're actually
>> interested in, to XLogPageRead(). We could add a new argument to the
>> callback for that, or we could keep xlogreader.c as it is and pass it
>> through from ReadRecord to XLogPageRead() in the private struct.
>>
>> An explicit argument to the callback is probably best. That's
>> straightforward, and it might be useful for the callback to know the actual
>> WAL position that xlogreader.c is interested in anyway. See attached.
>>
> Just to let you know that I am still getting the error even after commit
> 2ff6555.
> With the same scenario:
> 1) Start a master with 2 slaves
> 2) Kill/Stop slave
> 3) Promote slave 1, it switches to timeline 2
> Log on slave 1
>
> LOG: selected new timeline ID: 2
> 4) Reconnect slave 2 to save 1, slave 2 remains stuck in timeline 1 even
> if recovery_target_timeline is set to latest
> Log on slave 1 at this moment:
> DEBUG: received replication command: IDENTIFY_SYSTEM
> DEBUG: received replication command: TIMELINE_HISTORY 2
> DEBUG: received replication command: START_REPLICATION 0/5000000 TIMELINE
> 1
> Slave 1 receives command to start replication with timeline 1, while it is
> sync with timeline 2.
> Log on slave 2 at this moment:
> LOG: restarted WAL streaming at 0/5000000 on timeline 1
>
> LOG: replication terminated by primary server
> DETAIL: End of WAL reached on timeline 1 at 0/5014200
> DEBUG: walreceiver ended streaming and awaits new instructions
>
> The timeline history file is the same for both nodes:
> $ cat 00000002.history
> 1 0/5014200 no recovery target specified
>
> I might be wrong, but shouldn't there be also an entry for timeline 2 in
> this file?
>
> Am I missing something?
>
Sorry, there are no problems...
I simply forgot to set up recovery_target_timeline to 'latest' in
recovery.conf...
--
Michael Paquier
http://michael.otacoo.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-01-22 00:27:47 Re: Request for vote to move forward with recovery.conf overhaul
Previous Message Michael Paquier 2013-01-22 00:06:45 Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave