Re: pg_standby replication problem

Lists: pgsql-general
From: Khangelani Gama <kgama(at)argility(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: pg_standby replication problem
Date: 2014-06-09 14:28:53
Message-ID: ce3ab4298e3cc6f2751653d6f50f0342@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Please please help

*From:* Khangelani Gama [mailto:kgama(at)argility(dot)com]
*Sent:* Monday, June 09, 2014 1:42 PM
*To:* pgsql-general(at)postgresql(dot)org
*Subject:* pg_standby replication problem

Please help me with this, my secondary server shows a replication problem.
It stopped at the file called *0000000500004BAF000000AF …*then from here
primary server kept on sending walfiles, until the walfiles used up the
disc space in the data directory. How do I fix this problem. It’s postgres
9.1.2.

*Postgres log file Postgres-2014-06-08_000000.log file **has the following
details :*

2014-06-08 00:15:54 SAST LOG: restored log file *"0000000500004BAF000000AF"
from* archive

Trigger file: /tmp/recovery.pgsql.trigger.5432

Waiting for WAL file: 0000000500004BAF000000B0

WAL file path: /pgsql2/walfiles/0000000500004BAF000000B0

Restoring to: pg_xlog/RECOVERYXLOG

Sleep interval: 2 seconds

Max wait interval: 0 forever

*Command for restore: cp "/pgsql2/walfiles/0000000500004BAF000000B0"
"pg_xlog/RECOVERYXLOG"*

Keep archive history: 0000000500004BAE000000F7 and later

WAL file not present yet. Checking for trigger file...

WAL file not present yet. Checking for trigger file...

WAL file not present yet. Checking for trigger file...

WAL file not present yet. Checking for trigger file...

WAL file not present yet. Checking for trigger file...

CONFIDENTIALITY NOTICE
The contents of and attachments to this e-mail are intended for the addressee only, and may contain the confidential
information of Argility (Proprietary) Limited and/or its subsidiaries. Any review, use or dissemination thereof by anyone
other than the intended addressee is prohibited.If you are not the intended addressee please notify the writer immediately
and destroy the e-mail. Argility (Proprietary) Limited and its subsidiaries distance themselves from and accept no liability
for unauthorised use of their e-mail facilities or e-mails sent other than strictly for business purposes.


From: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To: Khangelani Gama <kgama(at)argility(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: pg_standby replication problem
Date: 2014-06-09 14:41:28
Message-ID: 5395C798.3060204@aklaver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 06/09/2014 07:28 AM, Khangelani Gama wrote:
> Please please help

Before anyone can help you will need to provide more information on what
your archiving, replication setup is. To begin:

1)Are you doing both archiving and streaming replication?

2) What are the settings in the configuration files for those operations?

3) What is the layout for archiving, in other words do the archived
files get copied remotely to a third site or some other arrangement?

4) What caused the trigger file to be set?

>
> *From:*Khangelani Gama [mailto:kgama(at)argility(dot)com
> <mailto:kgama(at)argility(dot)com>]
> *Sent:* Monday, June 09, 2014 1:42 PM
> *To:* pgsql-general(at)postgresql(dot)org <mailto:pgsql-general(at)postgresql(dot)org>
> *Subject:* pg_standby replication problem
>
> Please help me with this, my secondary server shows a replication
> problem. It stopped at the file called *0000000500004BAF000000AF …*then
> from here primary server kept on sending walfiles, until the walfiles
> used up the disc space in the data directory. How do I fix this problem.
> It’s postgres 9.1.2.
>
> *_Postgres log file Postgres-2014-06-08_000000.log file _*_has the
> following details :_
>
> 2014-06-08 00:15:54 SAST LOG: restored log file
> *"0000000500004BAF000000AF" from*archive
>
> Trigger file: /tmp/recovery.pgsql.trigger.5432
>
> Waiting for WAL file: 0000000500004BAF000000B0
>
> WAL file path: /pgsql2/walfiles/0000000500004BAF000000B0
>
> Restoring to: pg_xlog/RECOVERYXLOG
>
> Sleep interval: 2 seconds
>
> Max wait interval: 0 forever
>
> *Command for restore: cp "/pgsql2/walfiles/0000000500004BAF000000B0"
> "pg_xlog/RECOVERYXLOG"*
>
> Keep archive history: 0000000500004BAE000000F7 and later
>
> WAL file not present yet. Checking for trigger file...
>
> WAL file not present yet. Checking for trigger file...
>
> WAL file not present yet. Checking for trigger file...
>
> WAL file not present yet. Checking for trigger file...
>
> WAL file not present yet. Checking for trigger file...
>
>
> CONFIDENTIALITY NOTICE
> The contents of and attachments to this e-mail are intended for the addressee only, and may contain the confidential
> information of Argility (Proprietary) Limited and/or its subsidiaries. Any review, use or dissemination thereof by anyone
> other than the intended addressee is prohibited.If you are not the intended addressee please notify the writer immediately
> and destroy the e-mail. Argility (Proprietary) Limited and its subsidiaries distance themselves from and accept no liability
> for unauthorised use of their e-mail facilities or e-mails sent other than strictly for business purposes.
>
>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com


From: Alan Hodgson <ahodgson(at)simkin(dot)ca>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: pg_standby replication problem
Date: 2014-06-09 14:51:05
Message-ID: 10631191.DSiINcxZsJ@skynet.simkin.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Monday, June 09, 2014 04:28:53 PM Khangelani Gama wrote:
> Please help me with this, my secondary server shows a replication problem.
> It stopped at the file called *0000000500004BAF000000AF …*then from here
> primary server kept on sending walfiles, until the walfiles used up the
> disc space in the data directory. How do I fix this problem. It’s postgres
> 9.1.2.
>

It looks to me like your archive_command is probably failing on the primary
server. If that fails, the logs will build up and fill up your disk as
described. And they wouldn't be available to the slave to find.


From: Khangelani Gama <kgama(at)argility(dot)com>
To: Alan Hodgson <ahodgson(at)simkin(dot)ca>, pgsql-general(at)postgresql(dot)org
Subject: Re: pg_standby replication problem
Date: 2014-06-09 15:06:03
Message-ID: 36e864716fcb063194f5f95e5fc0b35c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

-----Original Message-----
From: pgsql-general-owner(at)postgresql(dot)org
[mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Alan Hodgson
Sent: Monday, June 09, 2014 4:51 PM
To: pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] pg_standby replication problem

On Monday, June 09, 2014 04:28:53 PM Khangelani Gama wrote:
> Please help me with this, my secondary server shows a replication problem.
> It stopped at the file called *0000000500004BAF000000AF …*then from
> here primary server kept on sending walfiles, until the walfiles used
> up the disc space in the data directory. How do I fix this problem.
> It’s postgres 9.1.2.
>

It looks to me like your archive_command is probably failing on the primary
server. If that fails, the logs will build up and fill up your disk as
described. And they wouldn't be available to the slave to find.

I am sorry, I am still trying to understand all the settings, the person who
set up the servers left the company.

In primary server, postgresql.conf shows the following:

# WRITE AHEAD LOG
#------------------------------------------------------------------------------

# - Settings -

wal_level = archive
# - Checkpoints -

checkpoint_segments = 128
checkpoint_timeout = 15min
checkpoint_warning = 885s
# - Archiving -

archive_mode = on
#archive_mode = off # allows archiving to be done
archive_command = '/home/cdbs/bin/run_replication.sh %p %f'

# REPLICATION
#------------------------------------------------------------------------------

# - Master Server -

# These settings are ignored on a standby server

max_wal_senders = 3

The setting archive_command points to a script being run and the variable %p
and %f being passed.

There is replication script running in the primary server has the
following:

while [ $test = "false" ]
do
rsync -a /pgsql2/data/${src}
postgres(at)10(dot)58(dot)101(dot)10:/pgsql2/walfiles/${dest} >>
/tmp/run_replication.sh.out 2>> /tmp/run_replication.sh.out
test=`ssh AB_CDS3 "if [ -f /pgsql2/walfiles/${dest} ];then echo
'true' ;else echo 'false';fi"`
if [ ${test} = "false" ]
then
echo "Test is false for CDS3, sleeping 10" >>
/tmp/run_replication.sh.out
sleep 10
cnt=$(( $cnt + 1 ))
if [ ${cnt} -ge 60 ]
then
message="Replication ERROR: Unable to send WAL
file(${desc}) from CDS to CDS3"
echo "`date` : ${message}" >>
/tmp/run_replication.sh.out
sendsms
fi
fi
done

--
Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

CONFIDENTIALITY NOTICE
The contents of and attachments to this e-mail are intended for the addressee only, and may contain the confidential
information of Argility (Proprietary) Limited and/or its subsidiaries. Any review, use or dissemination thereof by anyone
other than the intended addressee is prohibited.If you are not the intended addressee please notify the writer immediately
and destroy the e-mail. Argility (Proprietary) Limited and its subsidiaries distance themselves from and accept no liability
for unauthorised use of their e-mail facilities or e-mails sent other than strictly for business purposes.


From: Khangelani Gama <kgama(at)argility(dot)com>
To: Alan Hodgson <ahodgson(at)simkin(dot)ca>, pgsql-general(at)postgresql(dot)org
Subject: Re: pg_standby replication problem
Date: 2014-06-09 15:25:51
Message-ID: 00af6ac344633d45a78ec724ca7ff85e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I just saw got this from the primary server (/tmp/run_replication.sh.out),
secondary server's IP 10.58.101.10.

replication started: Sun Jun 8 00:05:26 SAST 2014 source:
pg_xlog/0000000500004BAF000000AF, dest: 0000000500004BAF000000AF
replication finished: Sun Jun 8 00:05:33 SAST 2014
replication started: Sun Jun 8 00:05:33 SAST 2014 source:
pg_xlog/0000000500004BAF000000B0, dest: 0000000500004BAF000000B0
ssh: connect to host 10.58.101.10 port 22: Connection timed out^M
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
replication finished: Sun Jun 8 00:07:41 SAST 2014
replication started: Sun Jun 8 00:07:41 SAST 2014 source:
pg_xlog/0000000500004BAF000000B1, dest: 0000000500004BAF000000B1
replication finished: Sun Jun 8 00:07:53 SAST 2014
replication started: Sun Jun 8 00:07:53 SAST 2014 source:
pg_xlog/0000000500004BAF000000B2, dest: 0000000500004BAF000000B2
replication finished: Sun Jun 8 00:07:57 SAST 2014
replication started: Sun Jun 8 00:07:58 SAST 2014 source:
pg_xlog/0000000500004BAF000000B3, dest: 0000000500004BAF000000B3
replication finished: Sun Jun 8 00:08:06 SAST 2014
replication started: Sun Jun 8 00:08:06 SAST 2014 source:
pg_xlog/0000000500004BAF000000B4, dest: 0000000500004BAF000000B4
replication finished: Sun Jun 8 00:08:11 SAST 2014
replication started: Sun Jun 8 00:08:11 SAST 2014 source:
pg_xlog/0000000500004BAF000000B5, dest: 0000000500004BAF000000B5
replication finished: Sun Jun 8 00:08:16 SAST 2014
replication started: Sun Jun 8 00:08:16 SAST 2014 source:
pg_xlog/0000000500004BAF000000B6, dest: 0000000500004BAF000000B6
replication finished: Sun Jun 8 00:08:22 SAST 2014

CONFIDENTIALITY NOTICE
The contents of and attachments to this e-mail are intended for the addressee only, and may contain the confidential
information of Argility (Proprietary) Limited and/or its subsidiaries. Any review, use or dissemination thereof by anyone
other than the intended addressee is prohibited.If you are not the intended addressee please notify the writer immediately
and destroy the e-mail. Argility (Proprietary) Limited and its subsidiaries distance themselves from and accept no liability
for unauthorised use of their e-mail facilities or e-mails sent other than strictly for business purposes.


From: Khangelani Gama <kgama(at)argility(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: pg_standby replication problem
Date: 2014-06-09 16:16:55
Message-ID: 42f9b1d2f7af877cf4832db7cde87686@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

-----Original Message-----
From: Khangelani Gama [mailto:kgama(at)argility(dot)com]
Sent: Monday, June 09, 2014 5:26 PM
To: 'Alan Hodgson'; 'pgsql-general(at)postgresql(dot)org'
Subject: RE: [GENERAL] pg_standby replication problem

I just saw got this from the primary server (/tmp/run_replication.sh.out),
secondary server's IP 10.58.101.10.

replication started: Sun Jun 8 00:05:26 SAST 2014 source:
pg_xlog/0000000500004BAF000000AF, dest: 0000000500004BAF000000AF replication
finished: Sun Jun 8 00:05:33 SAST 2014 replication started: Sun Jun 8
00:05:33 SAST 2014 source: pg_xlog/0000000500004BAF000000B0, dest:
0000000500004BAF000000B0
ssh: connect to host 10.58.101.10 port 22: Connection timed out^M
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
replication finished: Sun Jun 8 00:07:41 SAST 2014 replication started: Sun
Jun 8 00:07:41 SAST 2014 source: pg_xlog/0000000500004BAF000000B1, dest:
0000000500004BAF000000B1 replication finished: Sun Jun 8 00:07:53 SAST 2014
replication started: Sun Jun 8 00:07:53 SAST 2014 source:
pg_xlog/0000000500004BAF000000B2, dest: 0000000500004BAF000000B2 replication
finished: Sun Jun 8 00:07:57 SAST 2014 replication started: Sun Jun 8
00:07:58 SAST 2014 source: pg_xlog/0000000500004BAF000000B3, dest:
0000000500004BAF000000B3 replication finished: Sun Jun 8 00:08:06 SAST 2014
replication started: Sun Jun 8 00:08:06 SAST 2014 source:
pg_xlog/0000000500004BAF000000B4, dest: 0000000500004BAF000000B4 replication
finished: Sun Jun 8 00:08:11 SAST 2014 replication started: Sun Jun 8
00:08:11 SAST 2014 source: pg_xlog/0000000500004BAF000000B5, dest:
0000000500004BAF000000B5 replication finished: Sun Jun 8 00:08:16 SAST 2014
replication started: Sun Jun 8 00:08:16 SAST 2014 source:
pg_xlog/0000000500004BAF000000B6, dest: 0000000500004BAF000000B6 replication
finished: Sun Jun 8 00:08:22 SAST 2014

Since there was a Connection time out Problem in the primary server, how can
I make disc space in the secondary server for the replication to continue
from where it stopped. Do I remove waltfiles from the secondary server?

Disc space Breakdown:

4.0K ./backup
12K ./copy
4.9T ./data
204K ./test
16K ./lost+found
361G ./walfiles
5.3T .

CONFIDENTIALITY NOTICE
The contents of and attachments to this e-mail are intended for the addressee only, and may contain the confidential
information of Argility (Proprietary) Limited and/or its subsidiaries. Any review, use or dissemination thereof by anyone
other than the intended addressee is prohibited.If you are not the intended addressee please notify the writer immediately
and destroy the e-mail. Argility (Proprietary) Limited and its subsidiaries distance themselves from and accept no liability
for unauthorised use of their e-mail facilities or e-mails sent other than strictly for business purposes.