Re: loss of transactions in streaming replication

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: loss of transactions in streaming replication
Date: 2011-10-19 06:31:10
Message-ID: CAHGQGwFqEvHEZjgbefNWrxs9WCVKP9OE8x8L+==PKKT-Xab7MA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 19, 2011 at 11:28 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Convince me.  :-)

Yeah, I try.

> My reading of the situation is that you're talking about a problem
> that will only occur if, while the master is in the process of
> shutting down, a network error occurs.

No. This happens even if a network error doesn't occur. I can
reproduce the issue by doing the following:

1. Set up streaming replication master and standby with archive
setting.
2. Run pgbench -i
3. Shuts down the master with fast mode.

Then I can see that the latest WAL file in the master's pg_xlog
doesn't exist in the standby's one. The WAL record which was
lost was the shutdown checkpoint one.

When smart or fast shutdown is requested, the master tries to
write and send the WAL switch (if archiving is enabled) and
shutdown checkpoint record. Because of the problem I described,
the WAL switch record arrives at the standby but the shutdown
checkpoint does not.

> I am not sure it's a good idea
> to convolute the code to handle that case, because (1) there are going
> to be many similar situations where nothing within our power is
> sufficient to prevent WAL from failing to make it to the standby and

Shutting down the master is not a rare case. So I think it's worth
doing something.

> (2) for this marginal improvement, you're giving up including
> PQerrorMessage(streamConn) in the error message that ultimately gets
> omitted, which seems like a substantial regression as far as
> debuggability is concerned.

I think that it's possible to include PQerrorMessage() in the error
message. Will change the patch.

> Even if we do decide that we want the
> change in behavior, I see no compelling reason to back-patch it.
> Stable releases are supposed to be stable, not change behavior because
> we thought of something we like better than what we originally
> released.

The original behavior, in 9.0, is that all outstanding WAL are
replicated to the standby when the master shuts down normally.
But ISTM the behavior was changed unexpectedly in 9.1. So
I think that it should be back-patched to 9.1 to revert the behavior
to the original.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jun Ishiduka 2011-10-19 07:37:32 Re: Online base backup from the hot-standby
Previous Message Magnus Hagander 2011-10-19 05:22:08 Re: Silent failure with invalid hba_file setting