Standby catch up state change

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Standby catch up state change
Date: 2013-10-15 10:21:46
Message-ID: CABOikdMqc7qdkFqKvNg4HTYb-QjnR3VwY-PdbPq=+q6chRbt4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

I wonder if there is an issue with the way state change happens
from WALSNDSTATE_CATCHUP to WALSNDSTATE_STREAMING. Please note my question
is solely based on a strange behavior reported by a colleague and my
limited own code reading. The colleague is trying out replication with a
networking middleware and noticed that the master logs the debug message
about standby catching up, but the write_location in the
pg_stat_replication view takes minutes to reflect the actual catch up
location.

ISTM that the following code in walsender.c assumes that the standby has
caught up once master sends all the required WAL.

1548 /* Do we have any work to do? */
1549 Assert(sentPtr <= SendRqstPtr);
1550 if (SendRqstPtr <= sentPtr)
1551 {
1552 *caughtup = true;
1553 return;
1554 }

But what if the standby has not yet received all the WAL data sent by the
master ? It can happen for various reasons such as caching at the OS level
or the network layer on the sender machine or any other intermediate hops.

Should we not instead wait for the standby to have received all the WAL
before declaring that it has caught up ? If a failure happens while the
data is still in the sender's buffer, the standby may not actually catch up
to the desired point contrary to the LOG message displayed on the master.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-10-15 10:29:13 Re: Standby catch up state change
Previous Message Haribabu kommi 2013-10-15 10:07:27 Re: Heavily modified big table bloat even in auto vacuum is running