Re: Synchronization levels in SR

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synchronization levels in SR
Date: 2010-05-26 09:52:10
Message-ID: AANLkTikGPW4jtmcZ1RynXakG9LyPTCBrAJJnk1xByGgb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 26, 2010 at 5:02 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Everything I've said about "per-standby" settings applies here, which
> was based upon having just 2 settings: sync and async. If you have four
> settings instead, things get even more complex. If we were going to
> reduce complexity, it would be to reduce the number of options here to
> just offering option #2 in the first phase.
>
> AFAICS people would only ever select #2 or #4 anyway. IMHO #3 isn't
> likely to be selected on its own because it performs badly for no real
> benefit. Having two standbys, I might want to specify #2 to both, or if
> one is down then #3 to the remaining standby instead.

I guess that dropping the support of #3 doesn't reduce complexity since
the code of #3 is almost the same as that of #2. Like walreceiver sends
the ACK after receiving the WAL in #2 case, it has only to do the same
thing after the WAL flush.

> Nobody else has yet tried to explain how we would specify what happens
> when one of the standbys is down, with per-standby settings. Failure
> modes are where the complexity is here. However we proceed, we must have
> a discussion about how we specify the failure modes. This is not
> something we should add on at the last minute, we should think about
> that now and address it openly.

>> Imagine having 2 standbys, 1 synch, 1 async. If the synch server goes
>> down, performance will improve and robustness will have been lost. What
>> good would that be?

You are concerned about the above case you described on another post?
In that case, if you want to ensure robustness, you can specify #2, #3
or #4 in both standbys. If one of standbys is in remote site, we can
additionally set max_synchronous_standbys to 1. If you don't want to
failover to the standby in remote site when the master goes down, you
can specify #1 in remote standby, so the standby in the near location
is always guaranteed to be synch with the master.

> Oracle Data Guard is a great resource for what semantics we might need
> to cover, but its also a lesson in complexity from its per-standby
> settings. Please look at net_timeout and alternate options in
> particular. See how difficult it is to specify failure modes, even
> though Data Guard offers probably dozens of parameters and options - its
> orientation is per-standby not towards the transaction and the user.

Yeah, I'll research Oracle Data Guard.

> To summarise, I think we can get away with just 3 parameters:
> synchronous_replication = N     # similar in name to synchronous_commit
> synch_rep_timeout = T
> synch_rep_timeout_action = commit | abort

I agree to add the latter two parameters, which are also listed on
my outline of SynchRep.
http://wiki.postgresql.org/wiki/Streaming_Replication#Synchronization_capability

> Conceptually, this is "I want at least N replica copies made of my
> database changes, I will wait for up to T milliseconds to get that
> otherwise I will do X". Very easy and clear for an application to
> understand what guarantees it is requesting. Also very easy for the
> administrator to understand the guarantees requested and how to
> provision for them: to deliver robustness they typically need N+1
> servers, or for even higher levels of robustness and performance N+2
> etc..

I don't feel that "synchronous_replication" approach is intuitive for
the administrator. Even on this thread, some people seem to prefer
"per-standby" setting.

Without "per-standby" setting, when there are two standbys, one is in
the near rack and another is in remote site, "synchronous_replication=1"
cannot guarantee that the near standby is always synch with the master.
So when the master goes down, unfortunately we might have to failover to
the remote standby. OTOH, "synchronous_replication=2" degrades the
performance on the master very much. "synchronous_replication" approach
doesn't seem to cover the typical use case.

Also, when "synchronous_replication=1" and one of synchronous standbys
goes down, how should the surviving standby catch up with the master?
Such standby might be too far behind the master. The transaction commit
should wait for the ACK from the lagging standby immediately even if
there might be large gap? If yes, "synch_rep_timeout" would screw up
the replication easily.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2010-05-26 09:57:40 Re: Snapshot Materialized Views - GSoC
Previous Message Peter Eisentraut 2010-05-26 09:27:45 Re: mapping object names to role IDs