Standby promotion does not work

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Standby promotion does not work
Date: 2011-04-10 20:48:09
Message-ID: 4DA21789.2090403@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

All,

So I've finally been able to do some testing, and I'll report that
currently there is way I've found to get existing standbys to subscribe
to a new master.

No matter what I do in recovery.conf, it results in errors and failure
to replicate.

Test setup:
hosts: master1, master2, replica1
replica1 and master2 are subscribed to master1

First, master1 is shut down.
Second, master 2 is promoted via "pg_ctl promote"

So, original recovery.conf on replica1:

#autogenerated recovery.conf file. do not edit

standby_mode = 'on'
primary_conninfo = 'host=master1 port=5432 user=replication'
trigger_file = '/var/log/pgpool/trigger/trigger_file1'
restore_command = 'scp master1:/usr/local/pgsql/wal_share/%f %p'
recovery_target_timeline = 'latest'

This is changed to:

#autogenerated recovery.conf file. do not edit

standby_mode = 'on'
primary_conninfo = 'host=master1 port=5432 user=replication'
trigger_file = '/var/log/pgpool/trigger/trigger_file1'
restore_command = 'scp master1:/usr/local/pgsql/wal_share/%f %p'
recovery_target_timeline = 'latest'

On restart of replica1, I get the following error:

2011-04-10 13:27:24.766 PDT,,,2867,,4da212ac.b33,1,,2011-04-10 13:27:24
PDT,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery
target timeline 1",,,,,,,,,""
2011-04-10 13:27:29.875 PDT,,,2878,,4da212b1.b3e,1,,2011-04-10 13:27:29
PDT,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery
target timeline 1",,,,,,,,,""

If I try to manually change the timeline in recovery.conf to '2', I get:

2011-04-10 13:23:05.115 PDT,,,2834,,4da211a9.b12,2,,2011-04-10 13:23:05
PDT,,0,FATAL,XX000,"recovery target timeline 2 does not exist",,,,,,,,,""
2011-04-10 13:23:05.116 PDT,,,2832,,4da211a8.b10,1,,2011-04-10 13:23:04
PDT,,0,LOG,00000,"startup process (PID 2834) exited with exit code
1",,,,,,,,,""
2011-04-10 13:23:05.116 PDT,,,2832,,4da211a8.b10,2,,2011-04-10 13:23:04
PDT,,0,LOG,00000,"aborting startup due to startup process
failure",,,,,,,,,""

Receive location on master2:
0/93000078

Receive location on replica1:
0/93000000

... and in any case, this is a test system with no activity. So there's
no way we can replica1 be ahead.

So it seems like we still don't have any way to promote an existing
standby to a new master. Is this fixable?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2011-04-10 21:38:49 Re: BUG #5856: pg_attribute.attinhcount is not correct.
Previous Message Andrew Dunstan 2011-04-10 18:53:39 pgsql: Don't make "replication" magical as a user name, only as a datab