Testing Cascading Replication

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Testing Cascading Replication
Date: 2013-06-26 22:42:56
Message-ID: 51CB6E70.7040201@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Folks,

Wanted to give you the below testing emails from DHAVAL JAISWAL. He's
been testing 9.3's streaming-only cascading replication, and so far it
works as advertised. What he found in his tests was:

a) he could not remaster to a former replica which was behind the relica
he was trying to remaster

b) when servers where correctly caught up, remastering worked correctly

So, all good so far.

Text follows

======================

TEST 1: remastering failure due to picking the wrong replica

I have tested below scenario of the cascade replication for postgreSQL 9.3
beta version.

A

B.....................E
C...D

1) *A is the master,*

*B & E are pointing to the A, *

*C & D are pointing to the B.*

*Tested Scenarios are as follows: *
* *

* *

a) When (A) failed, we can able to promote B or E as the master and as
usual C & D would continue to talk with the B, if we have promoted B as the
master. If we have promoted E as the master in that case i have changed
recovery.conf of C & D and replace the port and IP pointing to the E. After
restarting of C & D, it has started to talk with the E.

b) When (B) failed, I have changed recovery.conf of C & D and replace
the port and IP pointing to the E. After restarting of C & D, it has
started to talk with the E. At last A would be the master, E is pointing to
A and C & D pointing to E.

Now, in a) scenario when we promote B as the master on failure of A, that
time C & D would continue to talk with the B. However, when i am changing
recovery.conf of E by replacing the port and IP of B. it is throwing
following errors.

cp: cannot stat `/usr/local/arch/00000002.history': No such file or
directory

cp: cannot stat `/usr/local/arch/00000003.history': No such file or
directory

LOG: entering standby mode

cp: cannot stat `/usr/local/arch/00000002.history': No such file or
directory

cp: cannot stat `/usr/local/arch/000000020000000000000027': No such file or
directory

cp: cannot stat `/usr/local/arch/000000010000000000000027': No such file or
directory

cp: cannot stat `/usr/local/arch/00000002.history': No such file or
directory

*FATAL: requested timeline 2 is not a child of this server's history *
* *

*DETAIL: Latest checkpoint is at 0/272DE57C on timeline 1, but in the
history of the requested timeline, the server forked off from that timeline
at 0/272DC548 *
* *

*LOG: startup process (PID 6155) exited with exit code 1 *
* *

LOG: aborting startup due to startup process failure

======================

TEST 2: Remastering success

Structure would be

* A* *(Master)*

*(Slave1)
B........................................E (Slave2)*

(Slave3) C.....D (Slave4)

(1) stopped the *node (A)*

(2) Following are the snaps of *slave1* & *slave2* after
stopping*node (A)
*

*slave 1*

postgres=# select pg_last_xact_replay_timestamp();
pg_last_xact_replay_timestamp
----------------------------------
2013-06-26 12:13:54.056954+05:30 --------------->
timing
(1 row)

postgres=# select pg_last_xlog_receive_location();
pg_last_xlog_receive_location
-------------------------------
0/3E000084 ---------------->
received wal
(1 row)

*slave 2
*
postgres=# select pg_last_xact_replay_timestamp();
pg_last_xact_replay_timestamp
----------------------------------
2013-06-26 12:13:54.056954+05:30 ---------------> timing
(1 row)

postgres=# select pg_last_xlog_receive_location();
pg_last_xlog_receive_location
------------------------------- ----------------> received
wal
0/3E000084
(1 row)

(3) Following are the logs on *slave1 while stopped node (A)*

FATAL: could not connect to the primary server: could not connect to
server: Connection refused
Is the server running on host "127.0.0.1" and accepting
TCP/IP connections on port 5432?

(4) Following are the logs on *slave2 while stopped node (A) *

FATAL: could not connect to the primary server: could not connect to
server: Connection refused
Is the server running on host "127.0.0.1" and accepting
TCP/IP connections on port 5432?

(5) Below *logs of slave1, when promoted slave1 as the master. *

LOG: received promote request
LOG: redo done at 0/3E000024
LOG: selected new timeline ID: 2
LOG: archive recovery complete
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

(6) Below logs when changed the recovery.conf of *slave2 and now it is
pointing to the slave1 after restart*.

LOG: database system was shut down in recovery at 2013-06-26 12:28:49 IST
LOG: entering standby mode
LOG: consistent recovery state reached at 0/3E000084
LOG: invalid record length at 0/3E000084
LOG: database system is ready to accept read only connections
LOG: fetching timeline history file for timeline 2 from primary server
LOG: started streaming WAL from primary at 0/3E000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/3E000084
LOG: new target timeline is 2
LOG: restarted WAL streaming at 0/3E000000 on timeline 2
LOG: redo starts at 0/3E000084

Now, at this time it has successfully connected to the master and started
working again.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2013-06-26 22:48:23 Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Previous Message Mark Kirkwood 2013-06-26 22:40:25 Re: Kudos for Reviewers -- straw poll