Re: Some problems about cascading replication

Lists: pgsql-hackers
From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Some problems about cascading replication
Date: 2011-08-16 08:55:49
Message-ID: CAHGQGwFBc7WW+uOZJ8OGhCa_2obojUdgS=w9Eu-wW1hV4xTA9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

When I tested the PITR on git master with max_wal_senders > 0,
I found that the following inappropriate log meesage was always
output even though cascading replication is not in progress. Attached
patch fixes this problem.

LOG: terminating all walsender processes to force cascaded
standby(s) to update timeline and reconnect

When making the patch, I found another problem about cascading
replication; When promoting a cascading standby, postmaster sends
SIGUSR2 to any cascading walsenders to kill them. But there is a
orner-case where such walsender fails to receive SIGUSR2 and
survives a standby promotion unexpectedly. This happens when
postmaster sends SIGUSR2 before the walsender marks itself as
a WAL sender, because postmaster sends SIGUSR2 to only the
processes marked as a WAL sender.

To avoid the corner-case, I changed walsender so that it checks
whether recovery is in progress or not again after marking itself
as a WAL sender. If recovery is not in progress even though the
walsender is cascading one, it does the same thing as SIGUSR2
signal handler does, and then exits later. Attached patch also includes
this fix.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
fix_some_problems_about_cascading_replication_v1.patch text/x-patch 1.8 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some problems about cascading replication
Date: 2011-08-16 13:25:15
Message-ID: CA+U5nMKtbD++BOma59KZqB6UMzxapj9VL7ZFd1kV7aP6KwAcoA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 16, 2011 at 9:55 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> When I tested the PITR on git master with max_wal_senders > 0,
> I found that the following inappropriate log meesage was always
> output even though cascading replication is not in progress. Attached
> patch fixes this problem.
>
>    LOG:  terminating all walsender processes to force cascaded
> standby(s) to update timeline and reconnect
>
> When making the patch, I found another problem about cascading
> replication; When promoting a cascading standby, postmaster sends
> SIGUSR2 to any cascading walsenders to kill them. But there is a
> orner-case where such walsender fails to receive SIGUSR2 and
> survives a standby promotion unexpectedly. This happens when
> postmaster sends SIGUSR2 before the walsender marks itself as
> a WAL sender, because postmaster sends SIGUSR2 to only the
> processes marked as a WAL sender.
>
> To avoid the corner-case, I changed walsender so that it checks
> whether recovery is in progress or not again after marking itself
> as a WAL sender. If recovery is not in progress even though the
> walsender is cascading one, it does the same thing as SIGUSR2
> signal handler does, and then exits later. Attached patch also includes
> this fix.

Looks like valid problems and appropriate fixes to me. Will commit.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some problems about cascading replication
Date: 2011-08-16 14:56:13
Message-ID: 4E4A850D.1000202@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 16.08.2011 16:25, Simon Riggs wrote:
> On Tue, Aug 16, 2011 at 9:55 AM, Fujii Masao<masao(dot)fujii(at)gmail(dot)com> wrote:
>
>> When I tested the PITR on git master with max_wal_senders> 0,
>> I found that the following inappropriate log meesage was always
>> output even though cascading replication is not in progress. Attached
>> patch fixes this problem.
>>
>> LOG: terminating all walsender processes to force cascaded
>> standby(s) to update timeline and reconnect
>>
>> When making the patch, I found another problem about cascading
>> replication; When promoting a cascading standby, postmaster sends
>> SIGUSR2 to any cascading walsenders to kill them. But there is a
>> orner-case where such walsender fails to receive SIGUSR2 and
>> survives a standby promotion unexpectedly. This happens when
>> postmaster sends SIGUSR2 before the walsender marks itself as
>> a WAL sender, because postmaster sends SIGUSR2 to only the
>> processes marked as a WAL sender.
>>
>> To avoid the corner-case, I changed walsender so that it checks
>> whether recovery is in progress or not again after marking itself
>> as a WAL sender. If recovery is not in progress even though the
>> walsender is cascading one, it does the same thing as SIGUSR2
>> signal handler does, and then exits later. Attached patch also includes
>> this fix.
>
> Looks like valid problems and appropriate fixes to me. Will commit.

I think there's a race condition here. If a walsender is just starting
up, it might not have registered itself as a walsender yet. It's
actually been there before this patch to suppress the log message.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some problems about cascading replication
Date: 2011-08-17 04:27:11
Message-ID: CAHGQGwH29TPJ9Aq=r5Pm1MCjvwV8H3Q6DEs7+=YxqNo+RYzUTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 16, 2011 at 11:56 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I think there's a race condition here. If a walsender is just starting up,
> it might not have registered itself as a walsender yet. It's actually been
> there before this patch to suppress the log message.

Right. To address this problem, I changed the patch so that "dead-end"
walsender (i.e., cascading walsender even though recovery is not in
progress) always emits the log message. This change would cause
duplicate log messages if the standby promotion is requested while
multiple walsenders including "dead-end" one are running. But since
this is less likely to happen, I don't think it's worth writing code to
suppress those duplicate log messages. Comments?

I attached the updated version of the patch.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
fix_some_problems_about_cascading_replication_v2.patch text/x-patch 2.7 KB