Re: Switching timeline over streaming replication

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
Cc: 'PostgreSQL-development' <pgsql-hackers(at)postgreSQL(dot)org>, Thom Brown <thom(at)linux(dot)com>
Subject: Re: Switching timeline over streaming replication
Date: 2012-11-19 17:23:39
Message-ID: 50AA6B1B.7020501@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10.10.2012 17:54, Thom Brown wrote:
> Hmm... I get something different. When I promote standby B, standby
> C's log shows:
>
> LOG: walreceiver ended streaming and awaits new instructions
> LOG: re-handshaking at position 0/4000000 on tli 1
> LOG: fetching timeline history file for timeline 2 from primary server
> LOG: walreceiver ended streaming and awaits new instructions
> LOG: new target timeline is 2
>
> Then when I stop then start standby C I get:
>
> FATAL: timeline history was not contiguous
> LOG: startup process (PID 22986) exited with exit code 1
> LOG: aborting startup due to startup process failure

Found & fixed this one. A paren was misplaced in tliOfPointInHistory()
function..

On 16.11.2012 16:01, Amit Kapila wrote:
> The following problems are observed while testing of the patch.
> Defect-1:
>
> 1. start primary A
> 2. start standby B following A
> 3. start cascade standby C following B.
> 4. Promote standby B.
> 5. After successful time line switch in cascade standby C, stop C.
> 6. Restart C, startup is failing with the following error.
>
> LOG: database system was shut down in recovery at 2012-11-16
> 16:26:29 IST
> FATAL: requested timeline 2 does not contain minimum recovery point
> 0/30143A0 on timeline 1
> LOG: startup process (PID 415) exited with exit code 1
> LOG: aborting startup due to startup process failure
>
> The above defect is already discussed in the following link.
> http://archives.postgresql.org/message-id/00a801cda6f3$4aba27b0$e02e7710$@ka
> pila(at)huawei(dot)com

Fixed now, sorry for neglecting this earlier. The problem was that if
the primary switched to a new timeline at position X, and the standby
followed that switch, on restart it would set minRecoveryPoint to X, and
the new

> Defect-2:
>
> 1. start primary A
> 2. start standby B following A
> 3. start cascade standby C following B with 'recovery_target_timeline'
> option in
> recovery.conf is disabled.
> 4. Promote standby B.
> 5. Cascade Standby C is not able to follow the new master B because of
> timeline difference.
> 6. Try to stop the cascade standby C (which is failing and the
> server is not stopping,
> observations are as WAL Receiver process is still running and
> clients are not allowing to connect).
>
> The defect-2 is happened only once in my test environment, I will try to
> reproduce it.

Found it. When restarting the streaming, I reused the WALRCV_STARTING
state. But if you then exited recovery, WalRcvRunning() would think that
the walreceiver is stuck starting up, because it's been longer than 10
seconds since it was launched and it's still in WALRCV_STARTING state,
so it put it into WALRCV_STOPPED state. And walreceiver didn't expect to
be put into STOPPED state after having started up successfully already.

I added a new explicit WALRCV_RESTARTING state to handle that.

In addition to the above bug fixes, there's some small changes since
last patch version:

* I changed the LOG messages printed in various stages a bit, hopefully
making it easier to follow what's happening. Feedback is welcome on when
and how we should log, and whether some error messages need clarification.

* 'ps' display is updated when the walreceiver enters and exits idle mode

* Updated pg_controldata and pg_resetxlog to handle the new
minRecoveryPointTLI field I added to the control file.

* startup process wakes up walsenders at the end of recovery, so that
cascading standbys are notified immediately when the timeline changes.
That removes some of the delay in the process.

- Heikki

Attachment Content-Type Size
streaming-tli-switch-7.patch.gz application/x-gzip 33.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stefan Kaltenbrunner 2012-11-19 17:24:16 Maintenance announcement for trill.postgresql.org
Previous Message Jeff Davis 2012-11-19 17:22:45 Re: Enabling Checksums