Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

Lists: pgsql-committerspgsql-hackers
From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: pgsql-committers(at)postgresql(dot)org
Subject: pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-06 22:51:56
Message-ID: E1PwMns-00047O-UU@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Efficient transaction-controlled synchronous replication.
If a standby is broadcasting reply messages and we have named
one or more standbys in synchronous_standby_names then allow
users who set synchronous_replication to wait for commit, which
then provides strict data integrity guarantees. Design avoids
sending and receiving transaction state information so minimises
bookkeeping overheads. We synchronize with the highest priority
standby that is connected and ready to synchronize. Other standbys
can be defined to takeover in case of standby failure.

This version has very strict behaviour; more relaxed options
may be added at a later date.

Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime
Casanova, Heikki Linnakangas and Robert Haas, plus the assistance
of many other design reviewers.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/a8a8a3e0965201df88bdfdff08f50e5c06c552b7

Modified Files
--------------
doc/src/sgml/config.sgml | 86 +++++++++++
doc/src/sgml/high-availability.sgml | 203 +++++++++++++++++++++++++
doc/src/sgml/monitoring.sgml | 7 +-
src/backend/access/transam/twophase.c | 25 +++
src/backend/access/transam/xact.c | 11 ++-
src/backend/catalog/system_views.sql | 4 +-
src/backend/postmaster/autovacuum.c | 7 +
src/backend/postmaster/postmaster.c | 3 +
src/backend/replication/Makefile | 2 +-
src/backend/replication/walreceiver.c | 9 +-
src/backend/replication/walsender.c | 65 +++++++-
src/backend/storage/ipc/shmqueue.c | 21 +++-
src/backend/storage/lmgr/proc.c | 12 ++
src/backend/utils/misc/guc.c | 19 +++
src/backend/utils/misc/postgresql.conf.sample | 11 ++-
src/include/catalog/pg_proc.h | 2 +-
src/include/replication/walsender.h | 22 +++
src/include/storage/lwlock.h | 1 +
src/include/storage/proc.h | 14 ++
src/include/storage/shmem.h | 3 +
src/test/regress/expected/rules.out | 2 +-
21 files changed, 507 insertions(+), 22 deletions(-)


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-06 23:09:53
Message-ID: 4D741441.2090704@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 03/06/2011 05:51 PM, Simon Riggs wrote:
> Efficient transaction-controlled synchronous replication.
>

I'm glad this is in, but I thought we agreed NOT to call it "synchronous
replication".

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-06 23:28:13
Message-ID: 22871.1299454093@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> Efficient transaction-controlled synchronous replication.

This patch broke the build. Kindly fix or revert at once.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-06 23:28:45
Message-ID: 1299454125.1696.6138.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Sun, 2011-03-06 at 18:09 -0500, Andrew Dunstan wrote:
>
> On 03/06/2011 05:51 PM, Simon Riggs wrote:
> > Efficient transaction-controlled synchronous replication.
> >
>
> I'm glad this is in, but I thought we agreed NOT to call it "synchronous
> replication".

The discussion on the thread was that its not sync rep unless we have
the strictest guarantees. We have the strictest guarantees, so it
qualifies as sync rep.

Relaxations are possible and, to some people, desirable.

Perhaps there is a more marketable term, and if so, we can rebrand. It
wouldn't be the first time things got renamed in beta.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-06 23:36:30
Message-ID: AANLkTi=_s9y09mdGVLX+tng2CJOROfmk68vzw+wBn6pf@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Sun, Mar 6, 2011 at 6:28 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> Efficient transaction-controlled synchronous replication.
>
> This patch broke the build.  Kindly fix or revert at once.
>

Seems Simon forgot to include src/include/replication/syncrep.h on the commit

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL


From: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-06 23:38:30
Message-ID: AANLkTi=icg5tORhth5U+89CStvZbkmz9yNVj_4iW1dDK@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Sun, Mar 6, 2011 at 6:36 PM, Jaime Casanova <jaime(at)2ndquadrant(dot)com> wrote:
> On Sun, Mar 6, 2011 at 6:28 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>>> Efficient transaction-controlled synchronous replication.
>>
>> This patch broke the build.  Kindly fix or revert at once.
>>
>
> Seems Simon forgot to include src/include/replication/syncrep.h on the commit
>

It doesn't have src/backend/replication/syncrep.c either

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-06 23:52:49
Message-ID: 1299455569.1696.6213.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Sun, 2011-03-06 at 18:28 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > Efficient transaction-controlled synchronous replication.
>
> This patch broke the build. Kindly fix or revert at once.

I think that's fixed it now. I was in the middle of doing that when your
last commit hit, so I had to rewind and try again.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 07:29:48
Message-ID: 4D74896C.5030402@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 07.03.2011 01:28, Simon Riggs wrote:
> On Sun, 2011-03-06 at 18:09 -0500, Andrew Dunstan wrote:
>>
>> On 03/06/2011 05:51 PM, Simon Riggs wrote:
>>> Efficient transaction-controlled synchronous replication.
>>
>> I'm glad this is in, but I thought we agreed NOT to call it "synchronous
>> replication".
>
> The discussion on the thread was that its not sync rep unless we have
> the strictest guarantees. We have the strictest guarantees, so it
> qualifies as sync rep.

What do you mean by "strictes guarantees"?

I don't see allow_synchronous_standby setting in the committed patch. I
presume you didn't make allow_synchronous_standby=off the default
behavior. Also, the documentation that describes this as two-safe
replication and claims that "the only possibility that data can be lost
is if both the primary and the standby suffer crashes at the same time"
needs big fat caveats to clarify that this doesn't actually achieve
those guarantees.

Please change the name.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 07:48:11
Message-ID: 1299484091.1696.7650.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, 2011-03-07 at 09:29 +0200, Heikki Linnakangas wrote:

> I presume you didn't make allow_synchronous_standby=off the default
> behavior.

You presume incorrectly.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 07:54:19
Message-ID: 4D748F2B.5010200@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 07.03.2011 09:48, Simon Riggs wrote:
> On Mon, 2011-03-07 at 09:29 +0200, Heikki Linnakangas wrote:
>
>> I presume you didn't make allow_synchronous_standby=off the default
>> behavior.

Sorry, s/allow_synchronous_standby/allow_standalone_master

> You presume incorrectly.

Ok, ok then. Thank you! Looks like I need to git pull and get myself
up-to-speed with these latest developments :-).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 08:27:16
Message-ID: AANLkTikJmP+bo1N-mFUWEpJiV6_OKisYw512OGeTJUbm@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Mar 7, 2011 at 7:51 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Efficient transaction-controlled synchronous replication.
> If a standby is broadcasting reply messages and we have named
> one or more standbys in synchronous_standby_names then allow
> users who set synchronous_replication to wait for commit, which
> then provides strict data integrity guarantees. Design avoids
> sending and receiving transaction state information so minimises
> bookkeeping overheads. We synchronize with the highest priority
> standby that is connected and ready to synchronize. Other standbys
> can be defined to takeover in case of standby failure.
>
> This version has very strict behaviour; more relaxed options
> may be added at a later date.

Pretty cool! I'd appreciate very much your efforts and contributions.

And,, I found one bug ;) You seem to have wrongly removed the check
of max_wal_senders in SyncRepWaitForLSN. This can make the
backend wait for replication even if max_wal_senders = 0. I could produce
this problematic situation in my machine. The attached patch fixes this problem.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
syncrep_check_max_wal_senders_v1.patch application/octet-stream 2.7 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 08:44:52
Message-ID: AANLkTinbzFaJXkzwm2xEegfytK1LPw8odo61wgZkkGp=@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Mar 7, 2011 at 5:27 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Mon, Mar 7, 2011 at 7:51 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Efficient transaction-controlled synchronous replication.
>> If a standby is broadcasting reply messages and we have named
>> one or more standbys in synchronous_standby_names then allow
>> users who set synchronous_replication to wait for commit, which
>> then provides strict data integrity guarantees. Design avoids
>> sending and receiving transaction state information so minimises
>> bookkeeping overheads. We synchronize with the highest priority
>> standby that is connected and ready to synchronize. Other standbys
>> can be defined to takeover in case of standby failure.
>>
>> This version has very strict behaviour; more relaxed options
>> may be added at a later date.
>
> Pretty cool! I'd appreciate very much your efforts and contributions.
>
> And,, I found one bug ;) You seem to have wrongly removed the check
> of max_wal_senders in SyncRepWaitForLSN. This can make the
> backend wait for replication even if max_wal_senders = 0. I could produce
> this problematic situation in my machine. The attached patch fixes this problem.

if (strlen(SyncRepStandbyNames) > 0 && max_wal_senders == 0)
ereport(ERROR,
(errmsg("Synchronous replication requires WAL streaming
(max_wal_senders > 0)")));

The above check should be required also after pg_ctl reload since
synchronous_standby_names can be changed by SIGHUP?
Or how about just removing that? If the patch I submitted is
committed,empty synchronous_standby_names and max_wal_senders = 0
settings is no longer unsafe.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 09:20:07
Message-ID: 1299489607.1696.7933.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, 2011-03-07 at 17:27 +0900, Fujii Masao wrote:
> On Mon, Mar 7, 2011 at 7:51 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> And,, I found one bug ;) You seem to have wrongly removed the check
> of max_wal_senders in SyncRepWaitForLSN. This can make the
> backend wait for replication even if max_wal_senders = 0. I could produce
> this problematic situation in my machine. The attached patch fixes this problem.

There may be a bug, but that's not the fix.

I spotted that issue myself in testing. I put in a protection to stop
setting synchronous_standby_names if max_wal_senders is zero, with error
message.

Are you saying the committed version doesn't trigger that ERROR?

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 09:28:25
Message-ID: AANLkTimDNqnzPdMji20RW09UfA1KHpPEMdbCkE9jKW-C@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Mar 7, 2011 at 6:20 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Mon, 2011-03-07 at 17:27 +0900, Fujii Masao wrote:
>> On Mon, Mar 7, 2011 at 7:51 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
>> And,, I found one bug ;) You seem to have wrongly removed the check
>> of max_wal_senders in SyncRepWaitForLSN. This can make the
>> backend wait for replication even if max_wal_senders = 0. I could produce
>> this problematic situation in my machine. The attached patch fixes this problem.
>
> There may be a bug, but that's not the fix.
>
> I spotted that issue myself in testing. I put in a protection to stop
> setting synchronous_standby_names if max_wal_senders is zero, with error
> message.
>
> Are you saying the committed version doesn't trigger that ERROR?

I changed synchronous_standby_names after startup and reloaded the
configuration file. So I didn't encounter such an error message.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 09:30:35
Message-ID: 1299490235.1696.7973.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, 2011-03-07 at 17:44 +0900, Fujii Masao wrote:

> The above check should be required also after pg_ctl reload since
> synchronous_standby_names can be changed by SIGHUP?
> Or how about just removing that? If the patch I submitted is
> committed,empty synchronous_standby_names and max_wal_senders = 0
> settings is no longer unsafe.

Ah, on reload. I plugged the gap only at startup.

I'll fix by changing assign_synchronous_standby_names(), not by changing
lots of other parts of code and making runtime check each COMMIT.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 09:47:20
Message-ID: AANLkTik4tuG2EA6oeiov1=DO6UcDoARP45Lk+KGfy7HC@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Mar 7, 2011 at 6:30 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Mon, 2011-03-07 at 17:44 +0900, Fujii Masao wrote:
>
>> The above check should be required also after pg_ctl reload since
>> synchronous_standby_names can be changed by SIGHUP?
>> Or how about just removing that? If the patch I submitted is
>> committed,empty synchronous_standby_names and max_wal_senders = 0
>> settings is no longer unsafe.
>
> Ah, on reload. I plugged the gap only at startup.
>
> I'll fix by changing assign_synchronous_standby_names(), not by changing
> lots of other parts of code and making runtime check each COMMIT.

I don't think that the check of local variable for each COMMIT wastes the
cycle so much. Anyway, the reload of the configuration file should not
cause the server to end unexpectedly. IOW, GUC assign hook should
use GUC_complaint_elevel instead of FATAL, in ereport. The attached
patch fixes that, and includes two typo fixes.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
use_guc_complaint_elevel_v1.patch application/octet-stream 2.2 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 11:21:39
Message-ID: AANLkTinHrymKd56m5AfawCdujuNM6B2g_--9UiOSSKGx@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Mar 7, 2011 at 5:27 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Mon, Mar 7, 2011 at 7:51 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Efficient transaction-controlled synchronous replication.
>> If a standby is broadcasting reply messages and we have named
>> one or more standbys in synchronous_standby_names then allow
>> users who set synchronous_replication to wait for commit, which
>> then provides strict data integrity guarantees. Design avoids
>> sending and receiving transaction state information so minimises
>> bookkeeping overheads. We synchronize with the highest priority
>> standby that is connected and ready to synchronize. Other standbys
>> can be defined to takeover in case of standby failure.
>>
>> This version has very strict behaviour; more relaxed options
>> may be added at a later date.
>
> Pretty cool! I'd appreciate very much your efforts and contributions.

Here are another comments:

if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 ||
SyncRepRequested())

Whenever synchronous_replication is TRUE, we disable synchronous_commit.
But, before disabling that, we should check also max_wal_senders and
synchronous_standby_names? Otherwise, synchronous_commit can
be disabled unexpectedly even in non replication case.

- /* Let the master know that we received some data. */
- XLogWalRcvSendReply();
- XLogWalRcvSendHSFeedback();

This change completely eliminates the difference between write_location
and flush_location in pg_stat_replication. If this change is reasoable, we
should get rid of write_location from pg_stat_replication since it's useless.
If not, this change should be reverted. I'm not sure whether monitoring
the difference between write and flush locations is useful. But I guess that
someone thought so and that code was added.

+ /*
+ * Current location of the head of the queue. All waiters should have
+ * a waitLSN that follows this value, or they are currently being woken
+ * to remove themselves from the queue. Protected by SyncRepLock.
+ */
+ XLogRecPtr lsn;

The comment ", or they are currently being woken to remove themselves
from the queue" is no longer required because the proc is currently removed
by walsender.

I found some typos. The attached patch fixes them.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
sync_rep_typo_fix_v1.patch application/octet-stream 2.0 KB

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 13:30:34
Message-ID: 4D74DDFA.70800@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 03/07/2011 02:29 AM, Heikki Linnakangas wrote:
> On 07.03.2011 01:28, Simon Riggs wrote:
>> On Sun, 2011-03-06 at 18:09 -0500, Andrew Dunstan wrote:
>>>
>>> On 03/06/2011 05:51 PM, Simon Riggs wrote:
>>>> Efficient transaction-controlled synchronous replication.
>>>
>>> I'm glad this is in, but I thought we agreed NOT to call it
>>> "synchronous
>>> replication".
>>
>> The discussion on the thread was that its not sync rep unless we have
>> the strictest guarantees. We have the strictest guarantees, so it
>> qualifies as sync rep.
>
> What do you mean by "strictes guarantees"?
>
> I don't see allow_synchronous_standby setting in the committed patch.
> I presume you didn't make allow_synchronous_standby=off the default
> behavior. Also, the documentation that describes this as two-safe
> replication and claims that "the only possibility that data can be
> lost is if both the primary and the standby suffer crashes at the same
> time" needs big fat caveats to clarify that this doesn't actually
> achieve those guarantees.
>
> Please change the name.
>

Previously, Simon said:

> Truly "synchronous" requires two-phase commit, which this never was.

So I too am confused about how it's now become "truly synchronous". Are
we saying this give the same or better guarantees than a 2PC setup?

cheers

andrew


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 14:02:44
Message-ID: 4D74E584.7080409@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 07.03.2011 15:30, Andrew Dunstan wrote:
> Previously, Simon said:
>
>> Truly "synchronous" requires two-phase commit, which this never was.
>
> So I too am confused about how it's now become "truly synchronous". Are
> we saying this give the same or better guarantees than a 2PC setup?

The guarantee we have now with synchronous_replication=on is that when
the server acknowledges a commit to the client (ie. when COMMIT command
returns), the transaction is safely flushed to disk on the master and at
least one synchronous standby server.

What you don't get is a guarantee on what happens to transactions that
were not acknowledged to the client. For example, if you pull the power
plug, the transaction that was just being committed might be committed
on the master, but not yet on the standby.

For me, that's enough to call it "synchronous replication". It provides
a useful guarantee to the client. But you could argue for an even
stricter definition, requiring atomicity so that if a transaction is not
successfully replicated for any reason, including crash, it is rolled
back in the master too. That would require 2PC.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 14:21:46
Message-ID: 4D74E9FA.7070202@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 03/07/2011 09:02 AM, Heikki Linnakangas wrote:
> On 07.03.2011 15:30, Andrew Dunstan wrote:
>> Previously, Simon said:
>>
>>> Truly "synchronous" requires two-phase commit, which this never was.
>>
>> So I too am confused about how it's now become "truly synchronous". Are
>> we saying this give the same or better guarantees than a 2PC setup?
>
> The guarantee we have now with synchronous_replication=on is that when
> the server acknowledges a commit to the client (ie. when COMMIT
> command returns), the transaction is safely flushed to disk on the
> master and at least one synchronous standby server.
>
> What you don't get is a guarantee on what happens to transactions that
> were not acknowledged to the client. For example, if you pull the
> power plug, the transaction that was just being committed might be
> committed on the master, but not yet on the standby.
>
> For me, that's enough to call it "synchronous replication". It
> provides a useful guarantee to the client. But you could argue for an
> even stricter definition, requiring atomicity so that if a transaction
> is not successfully replicated for any reason, including crash, it is
> rolled back in the master too. That would require 2PC.
>

My worry is that the stricter definition is what many people will
expect, without reading the fine print.

cheers

andrew


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 14:29:04
Message-ID: AANLkTinkzJ=UrLbThZy2KeQg+SjwYqj2r0mDomMwirB7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Mar 7, 2011 at 2:21 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

>> For me, that's enough to call it "synchronous replication". It provides a
>> useful guarantee to the client. But you could argue for an even stricter
>> definition, requiring atomicity so that if a transaction is not successfully
>> replicated for any reason, including crash, it is rolled back in the master
>> too. That would require 2PC.
>>
>
> My worry is that the stricter definition is what many people will expect,
> without reading the fine print.

They they are either already hosed or already using 2PC.

a.
--
Aidan Van Dyk                                             Create like a god,
aidan(at)highrise(dot)ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "Simon Riggs" <simon(at)2ndQuadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 14:55:11
Message-ID: 4D749D6F020000250003B55C@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> if you pull the power plug, the transaction that was just being
> committed might be committed on the master, but not yet on the
> standby.

> For me, that's enough to call it "synchronous replication". It
> provides useful guarantee to the client.

I don't think most people would expect full 2PC behavior from
something called "synchronous replication" -- I agree that a
guarantee that a successful commit means it has been written to the
master and at least one replica is sufficient.

> you could argue for an even stricter definition, requiring
> atomicity so that if a transaction is not successfully replicated
> for any reason, including crash, it is rolled back in the master
> too. That would require 2PC.

I'm not sure you can say it breaks atomicity; if proper procedures
are followed on recovery, all servers will either reflect the
transaction or not, right? It seems to me what you lose is the
ability to know whether a transaction for which commit was requested
and for which there had not yet been a reply at the time of failure
is going to be in your recovered database. In this particular
regard it is no different from a standalone or async replication,
and you would need 2PC with a proper transaction manager to do
better.

Getting that additional guarantee may not be worth the performance
hit for most people. We train our users to save (or make) a paper
copy of whet they were entering if a crash occurs (which, of course,
is very rare, but does happen), so they can check the state of it on
recovery. It is, of course, important for the programmers to use
appropriate database transaction boundaries so that the database is
always in a state with internal integrity and from which users can
determine the state and proceed on their own.

I think we should document the issues, of course.

If there is really a demand for a stricter "sync rep" feature, I
think it must be built on top of 2PC and some particular transaction
manager, which seems a though that makes it pgfoundry material.

-Kevin


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 15:03:10
Message-ID: 4D74F3AE.9000802@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 03/07/2011 09:29 AM, Aidan Van Dyk wrote:
> On Mon, Mar 7, 2011 at 2:21 PM, Andrew Dunstan<andrew(at)dunslane(dot)net> wrote:
>
>>> For me, that's enough to call it "synchronous replication". It provides a
>>> useful guarantee to the client. But you could argue for an even stricter
>>> definition, requiring atomicity so that if a transaction is not successfully
>>> replicated for any reason, including crash, it is rolled back in the master
>>> too. That would require 2PC.
>>>
>> My worry is that the stricter definition is what many people will expect,
>> without reading the fine print.
> They they are either already hosed or already using 2PC.
>
>

This is about expectations. The thing that worries me is that the use of
this term might cause some people NOT to use 2PC because they think they
are getting an equivalent guarantee, when in fact they are not. And
that's hardly unreasonable. Here for example is what wikipedia says
<http://en.wikipedia.org/wiki/Replication_%28computer_science%29>:

Synchronous replication - guarantees "zero data loss" by the means
of atomic write operation, i.e. write either completes on both sides
or not at all. Write is not considered complete until
acknowledgement by both local and remote storage.

cheers

andrew


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 15:04:22
Message-ID: AANLkTin27HX9bK4CV87_qaVUtykPFi0KF9wC5nuO0UvT@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Mar 7, 2011 at 2:29 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:

> They they are either already hosed or already using 2PC.

Sorry, to expand on my all too brief comment, even *without*
replication, they are hosed.

Once you issue commit, you have know knowledge if the commit is
durable, (or even posibly seen by somoene else even) until you get the
acknowledgement of the commit.

That's already a posibility with a single machine databse. Adding
replication in it, just increases the perioud that window exists for
(and the possiblity of things making something "Bad" hit that window).

a.

--
Aidan Van Dyk                                             Create like a god,
aidan(at)highrise(dot)ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 15:13:46
Message-ID: 4D74A1CA020000250003B569@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

> Synchronous replication - guarantees "zero data loss" by the
> means of atomic write operation, i.e. write either completes on
> both sides or not at all.

So far, so good.

> Write is not considered complete until acknowledgement by both
> local and remote storage.

OK, *if* we want to live up to this definition, we don't seem to
have that part covered. Of course, since the connection is broken
during the hypothetical crash, it seems hard to acknowledge it on
recovery, and short of 2PC I don't see how we roll it back. About
the best we could do is somehow have explicit logging of the
disposition of unacknowledged commit requests upon recovery, and
consider logging of success to be "acknowledgement". Is this
logging provided by other databases with "synchronous replication"
features?

-Kevin


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Aidan Van Dyk <aidan(at)highrise(dot)ca>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 15:46:52
Message-ID: 4D74FDEC.5040406@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 07.03.2011 17:03, Andrew Dunstan wrote:
> This is about expectations. The thing that worries me is that the use of
> this term might cause some people NOT to use 2PC because they think they
> are getting an equivalent guarantee, when in fact they are not. And
> that's hardly unreasonable. Here for example is what wikipedia says
> <http://en.wikipedia.org/wiki/Replication_%28computer_science%29>:
>
> Synchronous replication - guarantees "zero data loss" by the means
> of atomic write operation, i.e. write either completes on both sides
> or not at all. Write is not considered complete until
> acknowledgement by both local and remote storage.

Hmm, I've read that wikipedia definition before, but the "atomic" part
never caught my eye. You do get zero data loss with what we have; if a
meteor strikes the master, no acknowledged transaction is lost. I find
that definition a bit confusing.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Aidan Van Dyk <aidan(at)highrise(dot)ca>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 15:51:49
Message-ID: 4D74FF15.1070804@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 03/07/2011 10:46 AM, Heikki Linnakangas wrote:
> On 07.03.2011 17:03, Andrew Dunstan wrote:
>> This is about expectations. The thing that worries me is that the use of
>> this term might cause some people NOT to use 2PC because they think they
>> are getting an equivalent guarantee, when in fact they are not. And
>> that's hardly unreasonable. Here for example is what wikipedia says
>> <http://en.wikipedia.org/wiki/Replication_%28computer_science%29>:
>>
>> Synchronous replication - guarantees "zero data loss" by the means
>> of atomic write operation, i.e. write either completes on both sides
>> or not at all. Write is not considered complete until
>> acknowledgement by both local and remote storage.
>
> Hmm, I've read that wikipedia definition before, but the "atomic" part
> never caught my eye. You do get zero data loss with what we have; if a
> meteor strikes the master, no acknowledged transaction is lost. I find
> that definition a bit confusing.

Maybe it is - I agree the difference might be small. I'm just trying to
make sure we don't use a term that could mislead reasonable people about
what we're providing. If we're satisfied that we aren't, then keep it.

cheers

andrew


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-07 16:09:49
Message-ID: 1299513979-sup-1987@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Excerpts from Andrew Dunstan's message of lun mar 07 12:51:49 -0300 2011:
>
> On 03/07/2011 10:46 AM, Heikki Linnakangas wrote:

> > Hmm, I've read that wikipedia definition before, but the "atomic" part
> > never caught my eye. You do get zero data loss with what we have; if a
> > meteor strikes the master, no acknowledged transaction is lost. I find
> > that definition a bit confusing.
>
> Maybe it is - I agree the difference might be small. I'm just trying to
> make sure we don't use a term that could mislead reasonable people about
> what we're providing. If we're satisfied that we aren't, then keep it.

I think these terms are used inconsistenly enough across the industry
that what would make the most sense would be to use the common term and
document accurately what we mean by it, rather than relying on some
external entity's definition, which could change (like wikipedia's).

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-10 20:04:57
Message-ID: AANLkTinbJDsrkF9rsy8Wh_hrrrPP7CQaNcND_1A-aLMe@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Mar 7, 2011 at 6:21 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Mon, Mar 7, 2011 at 5:27 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Mon, Mar 7, 2011 at 7:51 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> Efficient transaction-controlled synchronous replication.
>>> If a standby is broadcasting reply messages and we have named
>>> one or more standbys in synchronous_standby_names then allow
>>> users who set synchronous_replication to wait for commit, which
>>> then provides strict data integrity guarantees. Design avoids
>>> sending and receiving transaction state information so minimises
>>> bookkeeping overheads. We synchronize with the highest priority
>>> standby that is connected and ready to synchronize. Other standbys
>>> can be defined to takeover in case of standby failure.
>>>
>>> This version has very strict behaviour; more relaxed options
>>> may be added at a later date.
>>
>> Pretty cool! I'd appreciate very much your efforts and contributions.
>
> Here are another comments:
>
>        if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 ||
> SyncRepRequested())
>
> Whenever synchronous_replication is TRUE, we disable synchronous_commit.
> But, before disabling that, we should check also max_wal_senders and
> synchronous_standby_names? Otherwise, synchronous_commit can
> be disabled unexpectedly even in non replication case.

Yeah, that's bad. At the risk of repeating myself, I don't think this
code should be checking SyncRepRequested() in the first place. If the
user has turned off synchronous_commit, then we should just commit
asynchronously, even if sync rep is otherwise in force. Otherwise,
this if statement is going to get really complicated. The logic is
already at least mildly wrong here anyway: clearly we do NOT need to
commit synchronously if the transaction has not written xlog, even if
sync rep is enabled.

> -                       /* Let the master know that we received some data. */
> -                       XLogWalRcvSendReply();
> -                       XLogWalRcvSendHSFeedback();
>
> This change completely eliminates the difference between write_location
> and flush_location in pg_stat_replication. If this change is reasoable, we
> should get rid of write_location from pg_stat_replication since it's useless.
> If not, this change should be reverted. I'm not sure whether monitoring
> the difference between write and flush locations is useful. But I guess that
> someone thought so and that code was added.

I could go either way on this but clearly we need to do one or the other.

> +       /*
> +        * Current location of the head of the queue. All waiters should have
> +        * a waitLSN that follows this value, or they are currently being woken
> +        * to remove themselves from the queue. Protected by SyncRepLock.
> +        */
> +       XLogRecPtr      lsn;
>
> The comment ", or they are currently being woken to remove themselves
> from the queue" is no longer required because the proc is currently removed
> by walsender.

Fixed.

> I found some typos. The attached patch fixes them.

Committed with minor changes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-10 20:44:33
Message-ID: AANLkTin=49_PyzXvujzRKbY9sxRZQZS9gXc_RePH33k6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Mar 7, 2011 at 3:27 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Mon, Mar 7, 2011 at 7:51 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Efficient transaction-controlled synchronous replication.
>> If a standby is broadcasting reply messages and we have named
>> one or more standbys in synchronous_standby_names then allow
>> users who set synchronous_replication to wait for commit, which
>> then provides strict data integrity guarantees. Design avoids
>> sending and receiving transaction state information so minimises
>> bookkeeping overheads. We synchronize with the highest priority
>> standby that is connected and ready to synchronize. Other standbys
>> can be defined to takeover in case of standby failure.
>>
>> This version has very strict behaviour; more relaxed options
>> may be added at a later date.
>
> Pretty cool! I'd appreciate very much your efforts and contributions.
>
> And,, I found one bug ;) You seem to have wrongly removed the check
> of max_wal_senders in SyncRepWaitForLSN. This can make the
> backend wait for replication even if max_wal_senders = 0. I could produce
> this problematic situation in my machine. The attached patch fixes this problem.

I committed a slightly different fix for this problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-10 21:28:12
Message-ID: AANLkTikmSJNsB=Y6c8t13TE63m_npNY0ktDtOs4ynZv0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Mar 7, 2011 at 4:47 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> Anyway, the reload of the configuration file should not
> cause the server to end unexpectedly. IOW, GUC assign hook should
> use GUC_complaint_elevel instead of FATAL, in ereport. The attached
> patch fixes that, and includes two typo fixes.

Committed the typo fixes, and the GUC assign hook fix as a separate commit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-11 10:46:03
Message-ID: AANLkTi=f+EtkPU+rRgz=Ay-X80OvCGvxK_O6LpL9_dss@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 11, 2011 at 5:04 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>        if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 ||
>> SyncRepRequested())
>>
>> Whenever synchronous_replication is TRUE, we disable synchronous_commit.
>> But, before disabling that, we should check also max_wal_senders and
>> synchronous_standby_names? Otherwise, synchronous_commit can
>> be disabled unexpectedly even in non replication case.
>
> Yeah, that's bad.  At the risk of repeating myself, I don't think this
> code should be checking SyncRepRequested() in the first place.  If the
> user has turned off synchronous_commit, then we should just commit
> asynchronously, even if sync rep is otherwise in force.  Otherwise,
> this if statement is going to get really complicated.   The logic is
> already at least mildly wrong here anyway: clearly we do NOT need to
> commit synchronously if the transaction has not written xlog, even if
> sync rep is enabled.

Yeah, not to wait for replication when synchronous_commit is disabled
seems to be more reasonable.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-17 17:46:37
Message-ID: AANLkTi=koHqna9WMm8_ATJN6c1GOLLp_0Tx6VswKhAdi@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 11, 2011 at 5:46 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Fri, Mar 11, 2011 at 5:04 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>        if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 ||
>>> SyncRepRequested())
>>>
>>> Whenever synchronous_replication is TRUE, we disable synchronous_commit.
>>> But, before disabling that, we should check also max_wal_senders and
>>> synchronous_standby_names? Otherwise, synchronous_commit can
>>> be disabled unexpectedly even in non replication case.
>>
>> Yeah, that's bad.  At the risk of repeating myself, I don't think this
>> code should be checking SyncRepRequested() in the first place.  If the
>> user has turned off synchronous_commit, then we should just commit
>> asynchronously, even if sync rep is otherwise in force.  Otherwise,
>> this if statement is going to get really complicated.   The logic is
>> already at least mildly wrong here anyway: clearly we do NOT need to
>> commit synchronously if the transaction has not written xlog, even if
>> sync rep is enabled.
>
> Yeah, not to wait for replication when synchronous_commit is disabled
> seems to be more reasonable.

On further review, I've changed my mind. Making synchronous_commit
trump synchronous_replication is appealing conceptually, but it's
going to lead to some weird corner cases. For example, a transaction
that drops a non-temporary relation always commits synchronously; and
2PC also ignores synchronous_commit. In the case where
synchronous_commit=off and synchronous_replication=on, we'd either
have to decide that these sorts of transactions aren't going to
replicate synchronously (which would give synchronous_commit a rather
long reach into areas it doesn't currently touch) or else that it's OK
for CREATE TABLE foo () to be totally asynchronous but that DROP TABLE
foo requires sync commit AND sync rep. That's pretty weird.

What makes more sense to me after having thought about this more
carefully is to simply make a blanket rule that when
synchronous_replication=on, synchronous_commit has no effect. That is
easy to understand and document. I'm inclined to think it's OK to let
synchronous_replication have this effect even if max_wal_senders=0 or
synchronous_standby_names=''; you shouldn't turn
synchronous_replication on just for kicks, and I don't think we want
to complicate the test in RecordTransactionCommit() more than
necessary. We should, however, adjust the logic so that a transaction
which has not written WAL can still commit asynchronously, because
such a transaction has only touched temp or unlogged tables and so
it's not important for it to make it to the standby, where that data
doesn't exist anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-17 17:52:56
Message-ID: AANLkTikP1TZVHp68nqk6p7aMJ4S0tUTxdEnbDdguzL_8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Thu, Mar 10, 2011 at 3:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> -                       /* Let the master know that we received some data. */
>> -                       XLogWalRcvSendReply();
>> -                       XLogWalRcvSendHSFeedback();
>>
>> This change completely eliminates the difference between write_location
>> and flush_location in pg_stat_replication. If this change is reasoable, we
>> should get rid of write_location from pg_stat_replication since it's useless.
>> If not, this change should be reverted. I'm not sure whether monitoring
>> the difference between write and flush locations is useful. But I guess that
>> someone thought so and that code was added.
>
> I could go either way on this but clearly we need to do one or the other.

I'm not really sure why this was part of the synchronous replication
patch, but after mulling it over I think it's probably right to rip
out write_location completely. There shouldn't ordinarily be much of
a gap between write location and flush location, so it's probably not
worth the extra network overhead to keep track of it. We might need
to re-add some form of this in the future if we have a version of
synchronous replication that only waits for confirmation of receipt
rather than for confirmation of flush, but we don't have that in 9.1,
so why bother?

Barring objections, I'll go do that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-17 17:56:36
Message-ID: 1300384596.18619.1823.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Thu, 2011-03-17 at 13:46 -0400, Robert Haas wrote:
> On Fri, Mar 11, 2011 at 5:46 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > On Fri, Mar 11, 2011 at 5:04 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >>> if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 ||
> >>> SyncRepRequested())
> >>>
> >>> Whenever synchronous_replication is TRUE, we disable synchronous_commit.
> >>> But, before disabling that, we should check also max_wal_senders and
> >>> synchronous_standby_names? Otherwise, synchronous_commit can
> >>> be disabled unexpectedly even in non replication case.
> >>
> >> Yeah, that's bad. At the risk of repeating myself, I don't think this
> >> code should be checking SyncRepRequested() in the first place. If the
> >> user has turned off synchronous_commit, then we should just commit
> >> asynchronously, even if sync rep is otherwise in force. Otherwise,
> >> this if statement is going to get really complicated. The logic is
> >> already at least mildly wrong here anyway: clearly we do NOT need to
> >> commit synchronously if the transaction has not written xlog, even if
> >> sync rep is enabled.
> >
> > Yeah, not to wait for replication when synchronous_commit is disabled
> > seems to be more reasonable.
>
> On further review, I've changed my mind. Making synchronous_commit
> trump synchronous_replication is appealing conceptually, but it's
> going to lead to some weird corner cases. For example, a transaction
> that drops a non-temporary relation always commits synchronously; and
> 2PC also ignores synchronous_commit. In the case where
> synchronous_commit=off and synchronous_replication=on, we'd either
> have to decide that these sorts of transactions aren't going to
> replicate synchronously (which would give synchronous_commit a rather
> long reach into areas it doesn't currently touch) or else that it's OK
> for CREATE TABLE foo () to be totally asynchronous but that DROP TABLE
> foo requires sync commit AND sync rep. That's pretty weird.
>
> What makes more sense to me after having thought about this more
> carefully is to simply make a blanket rule that when
> synchronous_replication=on, synchronous_commit has no effect. That is
> easy to understand and document. I'm inclined to think it's OK to let
> synchronous_replication have this effect even if max_wal_senders=0 or
> synchronous_standby_names=''; you shouldn't turn
> synchronous_replication on just for kicks, and I don't think we want
> to complicate the test in RecordTransactionCommit() more than
> necessary. We should, however, adjust the logic so that a transaction
> which has not written WAL can still commit asynchronously, because
> such a transaction has only touched temp or unlogged tables and so
> it's not important for it to make it to the standby, where that data
> doesn't exist anyway.

Agree to that.

Not read your other stuff yet, will do that later.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 05:45:44
Message-ID: AANLkTimAcuYBQZpmAyrHzzSFKVNQJKTvPpC-LRNx8WYL@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 2:52 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Mar 10, 2011 at 3:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> -                       /* Let the master know that we received some data. */
>>> -                       XLogWalRcvSendReply();
>>> -                       XLogWalRcvSendHSFeedback();
>>>
>>> This change completely eliminates the difference between write_location
>>> and flush_location in pg_stat_replication. If this change is reasoable, we
>>> should get rid of write_location from pg_stat_replication since it's useless.
>>> If not, this change should be reverted. I'm not sure whether monitoring
>>> the difference between write and flush locations is useful. But I guess that
>>> someone thought so and that code was added.
>>
>> I could go either way on this but clearly we need to do one or the other.
>
> I'm not really sure why this was part of the synchronous replication
> patch, but after mulling it over I think it's probably right to rip
> out write_location completely.  There shouldn't ordinarily be much of
> a gap between write location and flush location, so it's probably not
> worth the extra network overhead to keep track of it.  We might need
> to re-add some form of this in the future if we have a version of
> synchronous replication that only waits for confirmation of receipt
> rather than for confirmation of flush, but we don't have that in 9.1,
> so why bother?
>
> Barring objections, I'll go do that.

I agree to get rid of write_location.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 06:25:06
Message-ID: AANLkTim0DA09ANoUh+xPt21_dHW6McNWnqXopwiNFYVe@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 2:46 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On further review, I've changed my mind.  Making synchronous_commit
> trump synchronous_replication is appealing conceptually, but it's
> going to lead to some weird corner cases.  For example, a transaction
> that drops a non-temporary relation always commits synchronously; and
> 2PC also ignores synchronous_commit.  In the case where
> synchronous_commit=off and synchronous_replication=on, we'd either
> have to decide that these sorts of transactions aren't going to
> replicate synchronously (which would give synchronous_commit a rather
> long reach into areas it doesn't currently touch) or else that it's OK
> for CREATE TABLE foo () to be totally asynchronous but that DROP TABLE
> foo requires sync commit AND sync rep.  That's pretty weird.
>
> What makes more sense to me after having thought about this more
> carefully is to simply make a blanket rule that when
> synchronous_replication=on, synchronous_commit has no effect.  That is
> easy to understand and document.  I'm inclined to think it's OK to let
> synchronous_replication have this effect even if max_wal_senders=0 or
> synchronous_standby_names=''; you shouldn't turn
> synchronous_replication on just for kicks, and I don't think we want
> to complicate the test in RecordTransactionCommit() more than
> necessary.  We should, however, adjust the logic so that a transaction
> which has not written WAL can still commit asynchronously, because
> such a transaction has only touched temp or unlogged tables and so
> it's not important for it to make it to the standby, where that data
> doesn't exist anyway.

In the first place, I think that it's complicated to keep those two parameters
separately. What about merging them to one parameter? What I'm thinking
is to remove synchronous_replication and to increase the valid values of
synchronous_commit from on/off to async/local/remote/both. Each value
works as follows.

async = (synchronous_commit = off && synchronous_replication = off)
"async" makes a transaction do local WAL flush and replication
asynchronously.

local = (synchronous_commit = on && synchronous_replication = off)
"local" makes a transaction wait for only local WAL flush.

remote = (synchronous_commit = off && synchronous_replication = on)
"remote" makes a transaction wait for only replication. Local WAL flush is
performed by walwriter. This is useless in 9.1 because we always must
wait for local WAL flush when we wait for replication. But in the future,
if we'll be able to send WAL before WAL write (i.e., send WAL from
wal_buffers), this might become useful. In 9.1, it seems reasonable to
remove this value.

both = (synchronous_commit = on && synchronous_replication = on)
"both" makes a transaction wait for local WAL flush and replication.

Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 07:52:07
Message-ID: 1300434727.18619.10231.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, 2011-03-18 at 14:45 +0900, Fujii Masao wrote:
> On Fri, Mar 18, 2011 at 2:52 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > On Thu, Mar 10, 2011 at 3:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >>> - /* Let the master know that we received some data. */
> >>> - XLogWalRcvSendReply();
> >>> - XLogWalRcvSendHSFeedback();
> >>>
> >>> This change completely eliminates the difference between write_location
> >>> and flush_location in pg_stat_replication. If this change is reasoable, we
> >>> should get rid of write_location from pg_stat_replication since it's useless.
> >>> If not, this change should be reverted. I'm not sure whether monitoring
> >>> the difference between write and flush locations is useful. But I guess that
> >>> someone thought so and that code was added.
> >>
> >> I could go either way on this but clearly we need to do one or the other.
> >
> > I'm not really sure why this was part of the synchronous replication
> > patch, but after mulling it over I think it's probably right to rip
> > out write_location completely. There shouldn't ordinarily be much of
> > a gap between write location and flush location, so it's probably not
> > worth the extra network overhead to keep track of it. We might need
> > to re-add some form of this in the future if we have a version of
> > synchronous replication that only waits for confirmation of receipt
> > rather than for confirmation of flush, but we don't have that in 9.1,
> > so why bother?
> >
> > Barring objections, I'll go do that.
>
> I agree to get rid of write_location.

No, don't remove it.

We seem to be just looking for things to tweak without any purpose.
Removing this adds nothing for us.

We will have the column in the future, it is there now, so leave it.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 09:27:13
Message-ID: 4D832571.5030808@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Hi,

sorry for being late to join that bike-shedding discussion.

On 03/07/2011 05:09 PM, Alvaro Herrera wrote:
> I think these terms are used inconsistenly enough across the industry
> that what would make the most sense would be to use the common term and
> document accurately what we mean by it, rather than relying on some
> external entity's definition, which could change (like wikipedia's).

I absolutely agree to Alvaro here.

The Wikipedia definition seems to only speak about one local and one
remote node. Requiring an ack from "at least one" remote node seems to
cover that.

Not even Wikipedia goes further in their definition and tries to explain
what 'synchronous replication' could mean in case we have more than two
nodes. A somewhat common expectation is, that all nodes would have to
ack. However, with such a requirement a single node failure brings your
cluster to a full stop. So this isn't a practical option.

Google invented the term "semi-syncronous" for something that's
essentially the same that we have, now, I think. However, I full
heartedly hate that term (based on the reasoning that there's no
semi-pregnant, either).

Others (like me) use "synchronous" or (lately rather) "eager" to mean
that only a majority of nodes need to send an ACK. I have to explain
what I mean every time.

In the end, I don't have a strong opinion either way, anymore. I'm
happy to think of the replication between the master and the one standby
that's sending an ACK first as "synchronous". (Even if those may well
be different standbies for different transactions).

Hope to have brought some light into this discussion.

Regards

Markus Wanner


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 12:16:08
Message-ID: AANLkTikUAOYoStwkwG+DZOzTwT2QVj0H9aDywpzxvhxn@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 3:52 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> I agree to get rid of write_location.
>
> No, don't remove it.
>
> We seem to be just looking for things to tweak without any purpose.
> Removing this adds nothing for us.
>
> We will have the column in the future, it is there now, so leave it.

Well then can we revert the part of your patch that causes it to not
actually work any more?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 12:25:49
Message-ID: AANLkTim_Bk=jP6c6ibfDO8ujTfREMtrpxsdtaWd=nCZ9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 2:25 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> In the first place, I think that it's complicated to keep those two parameters
> separately. What about merging them to one parameter? What I'm thinking
> is to remove synchronous_replication and to increase the valid values of
> synchronous_commit from on/off to async/local/remote/both. Each value
> works as follows.
>
>    async   = (synchronous_commit = off && synchronous_replication = off)
>    "async" makes a transaction do local WAL flush and replication
> asynchronously.
>
>    local     = (synchronous_commit = on && synchronous_replication = off)
>    "local" makes a transaction wait for only local WAL flush.
>
>    remote = (synchronous_commit = off && synchronous_replication = on)
>    "remote" makes a transaction wait for only replication. Local WAL flush is
>    performed by walwriter. This is useless in 9.1 because we always must
>    wait for local WAL flush when we wait for replication. But in the future,
>    if we'll be able to send WAL before WAL write (i.e., send WAL from
>    wal_buffers), this might become useful. In 9.1, it seems reasonable to
>    remove this value.
>
>    both     = (synchronous_commit = on && synchronous_replication = on)
>    "both" makes a transaction wait for local WAL flush and replication.
>
> Thought?

Well, if we want to make this all use one parameter, the obvious way
to do it that wouldn't break backward compatibility is to remove the
synchronous_replication parameter altogether and let
synchronous_commit take on the values on/local/off, where on means
wait for sync rep if it's enabled (i.e.
synchronous_standby_names!=''&&max_wal_senders>0) or otherwise just
wait for local WAL flush, local means just wait for local WAL flush,
and off means commit asynchronously.

I'm OK with doing it that way if there's consensus on it, but I'm not
eager to break backward compatibility. Simon/Heikki, any opinion on
that approach?

If we don't have consensus on that then I think we should just do what
I proposed above (and Simon agreed to). I am not eager to spend any
longer than necessary hammering this out; I want to get to beta.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: MARK CALLAGHAN <mdcallag(at)gmail(dot)com>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 13:16:24
Message-ID: AANLkTi=B+cWE-pQJQM=zoA9=Zr25W7zPPMFvvBe11O4E@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 9:27 AM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
> Google invented the term "semi-syncronous" for something that's
> essentially the same that we have, now, I think.  However, I full
> heartedly hate that term (based on the reasoning that there's no
> semi-pregnant, either).

We didn't invent the term, we just implemented something that Heikki
Tuuri briefly described, for example:
http://bugs.mysql.com/bug.php?id=7440

In the Google patch and official MySQL version, the sequence is:
1) commit on master
2) wait for slave to ack
3) return to user

After step 1 another user on the master can observe the commit and the
following is possible:
1) commit on master
2) other user observes that commit on master
3) master blows up and a user observed a commit that never made it to a slave

I do not think this sequence should be possible in a sync replication
system. But it is possible in what has been implemented for MySQL.
Thus it was named semi-sync rather than sync.

--
Mark Callaghan
mdcallag(at)gmail(dot)com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: MARK CALLAGHAN <mdcallag(at)gmail(dot)com>
Cc: Markus Wanner <markus(at)bluegap(dot)ch>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 13:30:56
Message-ID: AANLkTimATgm3Dt-vOPL9CyqvfJ4knkgmJ_srB7iyfaiz@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 9:16 AM, MARK CALLAGHAN <mdcallag(at)gmail(dot)com> wrote:
> On Fri, Mar 18, 2011 at 9:27 AM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
>> Google invented the term "semi-syncronous" for something that's
>> essentially the same that we have, now, I think.  However, I full
>> heartedly hate that term (based on the reasoning that there's no
>> semi-pregnant, either).
>
> We didn't invent the term, we just implemented something that Heikki
> Tuuri briefly described, for example:
> http://bugs.mysql.com/bug.php?id=7440
>
> In the Google patch and official MySQL version, the sequence is:
> 1) commit on master
> 2) wait for slave to ack
> 3) return to user
>
> After step 1 another user on the master can observe the commit and the
> following is possible:
> 1) commit on master
> 2) other user observes that commit on master
> 3) master blows up and a user observed a commit that never made it to a slave
>
> I do not think this sequence should be possible in a sync replication
> system. But it is possible in what has been implemented for MySQL.
> Thus it was named semi-sync rather than sync.

Thanks for the insight. That can't happen with our implementation, I believe.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Markus Wanner" <markus(at)bluegap(dot)ch>, "MARK CALLAGHAN" <mdcallag(at)gmail(dot)com>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 13:40:22
Message-ID: 4D831A76020000250003BA82@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

MARK CALLAGHAN <mdcallag(at)gmail(dot)com> wrote:
> Markus Wanner <markus(at)bluegap(dot)ch> wrote:

>> Google invented the term "semi-syncronous" for something that's
>> essentially the same that we have, now, I think. However, I full
>> heartedly hate that term (based on the reasoning that there's no
>> semi-pregnant, either).

To be fair, what we're considering calling semi-synchronous is
something which tries to stay in synchronous mode but switches out
of it when necessary to meet availability targets. Your analogy
doesn't match up at all well -- at least without getting really
ugly.

> We didn't invent the term, we just implemented something that
> Heikki Tuuri briefly described, for example:
> http://bugs.mysql.com/bug.php?id=7440
>
> In the Google patch and official MySQL version, the sequence is:
> 1) commit on master
> 2) wait for slave to ack
> 3) return to user
>
> After step 1 another user on the master can observe the commit and
> the following is possible:
> 1) commit on master
> 2) other user observes that commit on master
> 3) master blows up and a user observed a commit that never made it
> to a slave
>
> I do not think this sequence should be possible in a sync
> replication system.

Then the only thing you would consider sync replication, as far as I
can see, is two phase commit, which we already have. So your use
case seems to be covered already, and we're trying to address other
people's needs. The guarantee that some people are looking for is
that a successful commit means that the data has been persisted on
two separate servers. Others want to try for that, but are willing
to compromise it for HA; in general I think they want to know when
the guarantee is not there so they can take action to get back to a
safer condition.

-Kevin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 14:10:03
Message-ID: AANLkTinRkoYcfrkMkzFsA3trGfwSxiqLDMw+PA9dicDt@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Mar 7, 2011 at 3:44 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Mon, Mar 7, 2011 at 5:27 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Mon, Mar 7, 2011 at 7:51 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> Efficient transaction-controlled synchronous replication.
>>> If a standby is broadcasting reply messages and we have named
>>> one or more standbys in synchronous_standby_names then allow
>>> users who set synchronous_replication to wait for commit, which
>>> then provides strict data integrity guarantees. Design avoids
>>> sending and receiving transaction state information so minimises
>>> bookkeeping overheads. We synchronize with the highest priority
>>> standby that is connected and ready to synchronize. Other standbys
>>> can be defined to takeover in case of standby failure.
>>>
>>> This version has very strict behaviour; more relaxed options
>>> may be added at a later date.
>>
>> Pretty cool! I'd appreciate very much your efforts and contributions.
>>
>> And,, I found one bug ;) You seem to have wrongly removed the check
>> of max_wal_senders in SyncRepWaitForLSN. This can make the
>> backend wait for replication even if max_wal_senders = 0. I could produce
>> this problematic situation in my machine. The attached patch fixes this problem.
>
>        if (strlen(SyncRepStandbyNames) > 0 && max_wal_senders == 0)
>                ereport(ERROR,
>                                (errmsg("Synchronous replication requires WAL streaming
> (max_wal_senders > 0)")));
>
> The above check should be required also after pg_ctl reload since
> synchronous_standby_names can be changed by SIGHUP?
> Or how about just removing that? If the patch I submitted is
> committed,empty synchronous_standby_names and max_wal_senders = 0
> settings is no longer unsafe.

This configuration is now harmless in the sense that it no longer
horribly breaks the entire system, but it's still pretty useless, so
this might be deemed a valuable sanity check. However, I'm reluctant
to leave it in there, because someone could change their config to
this state, pg_ctl reload, see everything working, and then later stop
the cluster and be unable to start it back up again. Since most
people don't shut their database systems down very often, they might
not discover that they have an invalid config until much later. I
think it's probably not a good idea to have configs that are valid on
reload but prevent startup, so I'm inclined to either remove this
check altogether or downgrade it to a warning.

As a side note, it's not very obvious why some parts of PostmasterMain
report problems by doing write_stderr() and exit() while other parts
use ereport(ERROR). This check and the nearby checks on WAL level are
immediately preceded and followed by other checks that use the
opposite technique.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: MARK CALLAGHAN <mdcallag(at)gmail(dot)com>
Cc: Markus Wanner <markus(at)bluegap(dot)ch>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 14:19:18
Message-ID: 1300457958.18619.13780.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, 2011-03-18 at 13:16 +0000, MARK CALLAGHAN wrote:
> On Fri, Mar 18, 2011 at 9:27 AM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
> > Google invented the term "semi-syncronous" for something that's
> > essentially the same that we have, now, I think. However, I full
> > heartedly hate that term (based on the reasoning that there's no
> > semi-pregnant, either).
>
> We didn't invent the term, we just implemented something that Heikki
> Tuuri briefly described, for example:
> http://bugs.mysql.com/bug.php?id=7440
>
> In the Google patch and official MySQL version, the sequence is:
> 1) commit on master
> 2) wait for slave to ack
> 3) return to user
>
> After step 1 another user on the master can observe the commit and the
> following is possible:
> 1) commit on master
> 2) other user observes that commit on master
> 3) master blows up and a user observed a commit that never made it to a slave
>
> I do not think this sequence should be possible in a sync replication
> system. But it is possible in what has been implemented for MySQL.
> Thus it was named semi-sync rather than sync.

Thanks for clearing it up Mark.

We should definitely not be calling what we have "semi-sync". The
semantics are very different.

In PostgreSQL other users cannot observe the commit until an
acknowledgement has been received.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: MARK CALLAGHAN <mdcallag(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 14:19:56
Message-ID: 4D836A0C.3000306@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Mark,

On 03/18/2011 02:16 PM, MARK CALLAGHAN wrote:
> We didn't invent the term, we just implemented something that Heikki
> Tuuri briefly described, for example:
> http://bugs.mysql.com/bug.php?id=7440

Oh, okay, good to know who to blame ;-) However, I didn't mean to
offend anybody.

> I do not think this sequence should be possible in a sync replication
> system. But it is possible in what has been implemented for MySQL.
> Thus it was named semi-sync rather than sync.

Sure?

Their documentation [1] isn't entirely clear on that first: "the master
blocks after the commit is done and waits until at least one
semisynchronous slave acknowledges that it has received all events for
the transaction" and the "slave acknowledges receipt of a transaction's
events only after the events have been written to its relay log and
flushed to disk".

But then continues to say that "[the master is] waiting for
acknowledgment from a slave after having performed a commit", so this
indeed sounds like the transaction is visible to other sessions before
the slave ACKs.

So, semi-sync may show temporary inconsistencies in case of a master
failure. Wow!

Regards

Markus Wanner

[1] MySQL 5.5 reference manual, 17.3.8. Semisynchronous Replication:
http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 14:37:26
Message-ID: 4D836E26.6080503@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Hi,

On 03/18/2011 02:40 PM, Kevin Grittner wrote:
> Then the only thing you would consider sync replication, as far as I
> can see, is two phase commit

I think waiting for the ACK before actually making the changes from the
transaction visible (COMMIT) would suffice for disallowing such an
inconsistency to manifest. But obviously, MySQL decided it's not worth
doing that, as it's such a rare event and a short period of time that
may show inconsistencies...

> people's needs. The guarantee that some people are looking for is
> that a successful commit means that the data has been persisted on
> two separate servers.

Well, MySQL's semi-sync also seems to guarantee that WRT the client
confirmation. And transactions always appear committed *before* the
client receives the COMMIT acknowledgement, due to the time it takes for
the ACK to arrive at the client.

It's just the commit *before* receiving the slave's ACK, which might
make a transaction visible that's not durable, yet. But I guess that
simplified implementation for them...

Regards

Markus Wanner


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "MARK CALLAGHAN" <mdcallag(at)gmail(dot)com>
Cc: "Markus Wanner" <markus(at)bluegap(dot)ch>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 14:52:13
Message-ID: 4D832B4D020000250003BA98@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:

> In PostgreSQL other users cannot observe the commit until an
> acknowledgement has been received.

Really? I hadn't picked up on that. That makes for a lot of
complication on crash-and-recovery of a master, but if we can pull
it off, that's really cool. If we do that and MySQL doesn't, we
definitely don't want to use the same terminology they do, which
would imply the same behavior.

Apologies for not picking up on that aspect of the implementation.

-Kevin


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 14:55:56
Message-ID: AANLkTimVQe4RNqG6jb4XgxSxe2u_3oh+8Jzssn5jDE65@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Thu, Mar 17, 2011 at 5:46 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> What makes more sense to me after having thought about this more
> carefully is to simply make a blanket rule that when
> synchronous_replication=on, synchronous_commit has no effect.  That is
> easy to understand and document.

For what it's worth "has no effect" doesn't make much sense to me.
It's a boolean, either commits are going to block or they're not.

What happened to the idea of a three-way switch?

synchronous_commit = off
synchronous_commit = disk
synchronous_commit = replica

With "on" being a synonym for "disk" for backwards compatibility.

Then we could add more options later for more complex conditions like
waiting for one server in each data centre or waiting for one of a
certain set of servers ignoring the less reliable mirrors, etc.

--
greg


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 15:07:23
Message-ID: AANLkTikmbNW=phsUiXJMLv7AaptObRL-pCea-Si-T=ms@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 10:55 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Thu, Mar 17, 2011 at 5:46 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> What makes more sense to me after having thought about this more
>> carefully is to simply make a blanket rule that when
>> synchronous_replication=on, synchronous_commit has no effect.  That is
>> easy to understand and document.
>
> For what it's worth "has no effect" doesn't make much sense to me.
> It's a boolean, either commits are going to block or they're not.
>
> What happened to the idea of a three-way switch?
>
> synchronous_commit = off
> synchronous_commit = disk
> synchronous_commit = replica
>
> With "on" being a synonym for "disk" for backwards compatibility.
>
> Then we could add more options later for more complex conditions like
> waiting for one server in each data centre or waiting for one of a
> certain set of servers ignoring the less reliable mirrors, etc.

This is similar to what I suggested upthread, except that I suggested
on/local/off, with the default being on. That way if you set
synchronous_standby_names, you get synchronous replication without
changing another setting, but you can say local instead if for some
reason you want the middle behavior. If we're going to do it all with
one GUC, I think that way makes more sense. If you're running sync
rep, you might still have some transactions that you don't care about,
but that's what async commit is for. It's a funny kind of transaction
that we're OK with losing if we have a failover but we're not OK with
losing if we have a local crash from which we recover without failing
over.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 15:40:57
Message-ID: 1300462857.18619.14696.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, 2011-03-18 at 11:07 -0400, Robert Haas wrote:
> On Fri, Mar 18, 2011 at 10:55 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > On Thu, Mar 17, 2011 at 5:46 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >> What makes more sense to me after having thought about this more
> >> carefully is to simply make a blanket rule that when
> >> synchronous_replication=on, synchronous_commit has no effect. That is
> >> easy to understand and document.
> >
> > For what it's worth "has no effect" doesn't make much sense to me.
> > It's a boolean, either commits are going to block or they're not.
> >
> > What happened to the idea of a three-way switch?
> >
> > synchronous_commit = off
> > synchronous_commit = disk
> > synchronous_commit = replica
> >
> > With "on" being a synonym for "disk" for backwards compatibility.
> >
> > Then we could add more options later for more complex conditions like
> > waiting for one server in each data centre or waiting for one of a
> > certain set of servers ignoring the less reliable mirrors, etc.
>
> This is similar to what I suggested upthread, except that I suggested
> on/local/off, with the default being on. That way if you set
> synchronous_standby_names, you get synchronous replication without
> changing another setting, but you can say local instead if for some
> reason you want the middle behavior. If we're going to do it all with
> one GUC, I think that way makes more sense. If you're running sync
> rep, you might still have some transactions that you don't care about,
> but that's what async commit is for. It's a funny kind of transaction
> that we're OK with losing if we have a failover but we're not OK with
> losing if we have a local crash from which we recover without failing
> over.

I much prefer a single switch, which is what I originally suggested.
Changing the meaning of synchronous_commit seems a problem.

durability = localmemory
durability = localdisk
(durability = remotereceive - has no meaning in current code)
durability = remotedisk
durability = remoteapply

it also allows us to have in the future

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 15:43:38
Message-ID: 4D837DAA.4000402@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 03/18/2011 03:52 PM, Kevin Grittner wrote:
> Really? I hadn't picked up on that. That makes for a lot of
> complication on crash-and-recovery of a master

What complication do you have in mind here?

I think of it the opposite way (at least for Postgres, that is):
committing a transaction that's not acknowledged means having to revert
a (locally only) committed transaction if you want to use the current
data to recover to some cluster-agreed state. (Of course, you can
always simply transfer the whole

If you don't commit the transaction before the ACK in the first place,
you don't have anything special to do upon recovery.

Regards

Markus Wanner


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 15:47:46
Message-ID: 4D837EA2.2070205@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 18.03.2011 16:52, Kevin Grittner wrote:
> Simon Riggs<simon(at)2ndQuadrant(dot)com> wrote:
>
>> In PostgreSQL other users cannot observe the commit until an
>> acknowledgement has been received.
>
> Really? I hadn't picked up on that. That makes for a lot of
> complication on crash-and-recovery of a master, but if we can pull
> it off, that's really cool. If we do that and MySQL doesn't, we
> definitely don't want to use the same terminology they do, which
> would imply the same behavior.

To be clear: other users cannot observe the commit until standby
acknowledges it - unless the master crashes while waiting for the
acknowledgment. If that happens, the commit will be visible to everyone
after recovery.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: MARK CALLAGHAN <mdcallag(at)gmail(dot)com>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 15:52:18
Message-ID: AANLkTik058OfhS-xEcNSJdGhcxS4sbntPQdKj9cLQY2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 2:19 PM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:

> Their documentation [1] isn't entirely clear on that first: "the master
> blocks after the commit is done and waits until at least one
> semisynchronous slave acknowledges that it has received all events for
> the transaction" and the "slave acknowledges receipt of a transaction's
> events only after the events have been written to its relay log and
> flushed to disk".
>
> But then continues to say that "[the master is] waiting for
> acknowledgment from a slave after having performed a commit", so this
> indeed sounds like the transaction is visible to other sessions before
> the slave ACKs.

Yes, their docs are not clear on this.

--
Mark Callaghan
mdcallag(at)gmail(dot)com


From: MARK CALLAGHAN <mdcallag(at)gmail(dot)com>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 16:03:03
Message-ID: AANLkTi=v5n4ODwfUU+Df_BKpk49r_U=FMHtOnYUNPFa5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 2:37 PM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
> Hi,
>
> On 03/18/2011 02:40 PM, Kevin Grittner wrote:
>> Then the only thing you would consider sync replication, as far as I
>> can see, is two phase commit
>
> I think waiting for the ACK before actually making the changes from the
> transaction visible (COMMIT) would suffice for disallowing such an
> inconsistency to manifest.  But obviously, MySQL decided it's not worth
> doing that, as it's such a rare event and a short period of time that
> may show inconsistencies...

There are fewer options for implementing this in MySQL because
replication requires a binlog on the master and that requires the
internal use of XA to keep the binlog and InnoDB in sync as they are
separate resource managers. In theory, this can be changed so that
commit is only forced for the binlog and then on a crash missing
transactions could be copied from the binlog to InnoDB but I don't
think this will ever change.

By "fewer options" I mean that commit in MySQL with InnoDB and the
binlog requires:
1) prepare to InnoDB (force transaction log to disk for changes from
this transaction)
2) write binlog events from this transaction to the binlog
3) write XID event to the binlog (at this point transaction commit is
official, will survive a crash)
4) force binlog to disk
5) release row locks held by transaction in innodb
6) write commit record to innodb transaction log
7) force write of commit record to disk

Group commit is done for the fsyncs from steps 1 and 7. It is not done
for the fsync done in step 4.

Regardless, the processing above is complicated even without
semi-sync. AFAIK, semi-sync code occurs after step 7 but I have not
looked at the official version of semi-sync code in MySQL and my
memory of the work we did at Google is vague.

It is great if Postgres doesn't have this issue. It wasn't clear to me
from lurking on this list. I hope your docs highlight the behavior as
not having the issue is a big deal.

--
Mark Callaghan
mdcallag(at)gmail(dot)com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 16:19:31
Message-ID: 1300465171.18619.15193.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, 2011-03-18 at 17:47 +0200, Heikki Linnakangas wrote:
> On 18.03.2011 16:52, Kevin Grittner wrote:
> > Simon Riggs<simon(at)2ndQuadrant(dot)com> wrote:
> >
> >> In PostgreSQL other users cannot observe the commit until an
> >> acknowledgement has been received.
> >
> > Really? I hadn't picked up on that. That makes for a lot of
> > complication on crash-and-recovery of a master, but if we can pull
> > it off, that's really cool. If we do that and MySQL doesn't, we
> > definitely don't want to use the same terminology they do, which
> > would imply the same behavior.
>
> To be clear: other users cannot observe the commit until standby
> acknowledges it - unless the master crashes while waiting for the
> acknowledgment. If that happens, the commit will be visible to everyone
> after recovery.

No, only in the case where you choose not to failover to the standby
when you crash, which would be a fairly strange choice after the effort
to set up the standby. In a correctly configured and operated cluster
what I say above is fully correct and needs no addendum.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Markus Wanner" <markus(at)bluegap(dot)ch>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "MARK CALLAGHAN" <mdcallag(at)gmail(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 16:27:57
Message-ID: 4D8341BD020000250003BAB2@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

> On 18.03.2011 16:52, Kevin Grittner wrote:
>> Simon Riggs<simon(at)2ndQuadrant(dot)com> wrote:
>>
>>> In PostgreSQL other users cannot observe the commit until an
>>> acknowledgement has been received.
>>
>> Really? I hadn't picked up on that. That makes for a lot of
>> complication on crash-and-recovery of a master, but if we can
>> pull it off, that's really cool.

Markus Wanner <markus(at)bluegap(dot)ch> wrote:

> What complication do you have in mind here?

Basically, what Heikki addresses. It has to be committed after
crash and recovery, and deal with replicas which may or may not have
been notified and may or may not have applied the transaction.

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> To be clear: other users cannot observe the commit until standby
> acknowledges it - unless the master crashes while waiting for the
> acknowledgment. If that happens, the commit will be visible to
> everyone after recovery.

Right. If other transactions cannot see the transaction before the
COMMIT returns, I was kinda assuming that this was the behavior,
because otherwise one or more replicas could be ahead of the master
after recovery, which would be horribly broken. I agree that the
behavior which you describe is much better than allowing other
transactions to see the work of the pending COMMIT.

In fact, on further reflection, allowing other transactions to see
work before the committing transaction returns could lead to broken
behavior if that viewing transaction took some action based on the
that, the master crashed, recovery was done using a standby, and
that standby hadn't persisted the transaction. So this behavior is
necessary for good behavior. Even though that "perfect storm" of
events might be fairly rare, the difference in the level of
confidence in correctness is significant, and certainly something to
brag about.

-Kevin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 16:33:26
Message-ID: AANLkTinSxW_DrJwwfcOoEoJV-UeuVZjrtoSKeP5R69WE@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 12:19 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Fri, 2011-03-18 at 17:47 +0200, Heikki Linnakangas wrote:
>> On 18.03.2011 16:52, Kevin Grittner wrote:
>> > Simon Riggs<simon(at)2ndQuadrant(dot)com>  wrote:
>> >
>> >> In PostgreSQL other users cannot observe the commit until an
>> >> acknowledgement has been received.
>> >
>> > Really?  I hadn't picked up on that.  That makes for a lot of
>> > complication on crash-and-recovery of a master, but if we can pull
>> > it off, that's really cool.  If we do that and MySQL doesn't, we
>> > definitely don't want to use the same terminology they do, which
>> > would imply the same behavior.
>>
>> To be clear: other users cannot observe the commit until standby
>> acknowledges it - unless the master crashes while waiting for the
>> acknowledgment. If that happens, the commit will be visible to everyone
>> after recovery.
>
> No, only in the case where you choose not to failover to the standby
> when you crash, which would be a fairly strange choice after the effort
> to set up the standby. In a correctly configured and operated cluster
> what I say above is fully correct and needs no addendum.

Except it doesn't work that way. If, say, a backend on the master
core dumps, the system will perform a crash and restart cycle, and the
transaction will become visible whether it's yet been replicated or
not. Since we now have a GUC to suppress restart after a backend
crash, it's theoretically possible to set up the system so that this
doesn't occur, but it'd take quite a bit of work to make it robust and
automatic, and it's certainly not the default out of the box.

The fundamental problem here is that once you update CLOG and flush
the corresponding WAL record, there is no going backward. You can
hold the system in some intermediate state where the transaction still
holds locks and is excluded from MVCC snapshots, but there's no way to
back up. So there are bound to be corner cases where the where the
wait doesn't last as long as you want, and stuff leaks out around the
edges. It's fundamentally impossible to guarantee that you'll remain
in that intermediate state forever - what do you do if a meteor hits
the synchronous standby and at the same time you lose power to the
master? No amount of configuration will save you from coming back on
line with a visible-but-unreplicated transaction. I'm not knocking
the system; I think what we have is impressively good. But pretending
that corner cases can't happen gets us nowhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Markus Wanner" <markus(at)bluegap(dot)ch>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "MARK CALLAGHAN" <mdcallag(at)gmail(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 16:48:57
Message-ID: 4D8346A9020000250003BAC3@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>> No, only in the case where you choose not to failover to the
>> standby when you crash, which would be a fairly strange choice
>> after the effort to set up the standby. In a correctly configured
>> and operated cluster what I say above is fully correct and needs
>> no addendum.

> what do you do if a meteor hits the synchronous standby and at the
> same time you lose power to the master? No amount of
> configuration will save you from coming back on line with a
> visible-but-unreplicated transaction.

You don't even need to postulate an extreme condition like that; we
prefer to have a DBA pull the trigger on a failover, rather than
trust the STONITH call to software. This is particularly true when
the master is local to its primary users and the replica is remote
to them.

-Kevin


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 17:35:46
Message-ID: AANLkTikY31HTM=JPY+UyKG1EuhMy-iyaDbLH-j51uLfj@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 4:33 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> The fundamental problem here is that once you update CLOG and flush
> the corresponding WAL record, there is no going backward.  You can
> hold the system in some intermediate state where the transaction still
> holds locks and is excluded from MVCC snapshots, but there's no way to
> back up.  So there are bound to be corner cases where the where the
> wait doesn't last as long as you want, and stuff leaks out around the
> edges.

I'm finding this whole idea of hiding the committed transaction until
the slave acks it kind of strange. It means there are times when the
slave is actually *ahead* of the master which would actually be kind
of hard to code against if you're trying to use the slave as a
possibly-not-up-to-date mirror.

I think promising that the COMMIT doesn't return until the transaction
and all previous transactions are replicated is enough. We don't have
to promise that nobody else will see it either. Those same
transactions eventually have to commit as well and if they want that
level of protection they can block waiting until they're replicated as
well which will imply that anything they depended on will be
replicated.

This is akin to the synchronous_commit=off case where other
transactions can see your data as soon as you commit even before the
xlog is fsynced. If you have synchronous_commit mode enabled then
you'll block until your xlog is fsynced and that will implicitly mean
the other transactions you saw were also fsynced.

--
greg


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 19:17:56
Message-ID: 4D83AFE4.8020207@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 03/18/2011 06:35 PM, Greg Stark wrote:
> I think promising that the COMMIT doesn't return until the transaction
> and all previous transactions are replicated is enough. We don't have
> to promise that nobody else will see it either. Those same
> transactions eventually have to commit as well

No, they don't have to. They can ROLLBACK, get aborted, lose connection
to the master, etc.. The issue here is that, given the MySQL scheme,
these transactions see a snapshot that's not durable, because at that
point in time, no standby guarantees to have stored the transaction to
be committed, yet. So in case of a failover, you'd suddenly see a
different snapshot (and lose changes of that transaction).

> This is akin to the synchronous_commit=off case where other
> transactions can see your data as soon as you commit even before the
> xlog is fsynced. If you have synchronous_commit mode enabled then
> you'll block until your xlog is fsynced and that will implicitly mean
> the other transactions you saw were also fsynced.

Somewhat, yes. And for exactly that reason, most users run with
synchronous_commit enabled. They don't want to lose committed transactions.

Regards

Markus Wanner


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 19:19:06
Message-ID: 4D83B02A.5090105@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Simon,

On 03/18/2011 05:19 PM, Simon Riggs wrote:
>>> Simon Riggs<simon(at)2ndQuadrant(dot)com> wrote:
>>>> In PostgreSQL other users cannot observe the commit until an
>>>> acknowledgement has been received.

On other nodes as well? To me that means the standby needs to hold back
COMMIT of an ACKed transaction, until receives a re-ACK from the master,
that it committed the transaction there. How else could the slave know
when to commit its ACKed transactions?

> No, only in the case where you choose not to failover to the standby
> when you crash, which would be a fairly strange choice after the effort
> to set up the standby. In a correctly configured and operated cluster
> what I say above is fully correct and needs no addendum.

If you don't failover, how can the standby be ahead of the master, given
it takes measures not to be during normal operation?

Eager to understand... ;-)

Regards

Markus


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 19:22:20
Message-ID: 4D83B0EC.1010105@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 03/18/2011 05:27 PM, Kevin Grittner wrote:
> Basically, what Heikki addresses. It has to be committed after
> crash and recovery, and deal with replicas which may or may not have
> been notified and may or may not have applied the transaction.

Huh? I'm not quite following here. Committing additional transactions
isn't a problem, reverting committed transactions is.

And yes, given that we only wait for ACK from a single standby, you'd
have to failover to exactly *that* standby to guarantee consistency.

> In fact, on further reflection, allowing other transactions to see
> work before the committing transaction returns could lead to broken
> behavior if that viewing transaction took some action based on the
> that, the master crashed, recovery was done using a standby, and
> that standby hadn't persisted the transaction. So this behavior is
> necessary for good behavior.

I fully agree to that.

Regards

Markus


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 19:29:50
Message-ID: 1300476590.18619.18115.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:
> Simon,
>
> On 03/18/2011 05:19 PM, Simon Riggs wrote:
> >>> Simon Riggs<simon(at)2ndQuadrant(dot)com> wrote:
> >>>> In PostgreSQL other users cannot observe the commit until an
> >>>> acknowledgement has been received.
>
> On other nodes as well? To me that means the standby needs to hold back
> COMMIT of an ACKed transaction, until receives a re-ACK from the master,
> that it committed the transaction there. How else could the slave know
> when to commit its ACKed transactions?

We could do that easily enough, actually, if we wished.

Do we wish?

> > No, only in the case where you choose not to failover to the standby
> > when you crash, which would be a fairly strange choice after the effort
> > to set up the standby. In a correctly configured and operated cluster
> > what I say above is fully correct and needs no addendum.
>
> If you don't failover, how can the standby be ahead of the master, given
> it takes measures not to be during normal operation?
>
> Eager to understand... ;-)
>
> Regards
>
> Markus

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Markus Wanner" <markus(at)bluegap(dot)ch>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "MARK CALLAGHAN" <mdcallag(at)gmail(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 19:34:18
Message-ID: 4D836D6A020000250003BAE4@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
> On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:

>> >>> Simon Riggs<simon(at)2ndQuadrant(dot)com> wrote:
>> >>>> In PostgreSQL other users cannot observe the commit until an
>> >>>> acknowledgement has been received.
>>
>> On other nodes as well? To me that means the standby needs to
>> hold back COMMIT of an ACKed transaction, until receives a re-ACK
>> from the master, that it committed the transaction there. How
>> else could the slave know when to commit its ACKed transactions?
>
> We could do that easily enough, actually, if we wished.
>
> Do we wish?

+1

If we're going out of our way to suppress it on the master until the
COMMIT returns, it shouldn't be showing on the replicas before that.

-Kevin


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 19:41:00
Message-ID: 4D83B54C.2030708@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 03/18/2011 08:29 PM, Simon Riggs wrote:
> We could do that easily enough, actually, if we wished.
>
> Do we wish?

I personally don't see any problem letting a standby show a snapshot
before the master. I'd consider it unneeded network traffic. But then
again, I'm completely biased.

Regards

Markus Wanner


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 21:08:11
Message-ID: AANLkTimdn_yVBwsNXn9bnOvDmhxHKyLNBqJYab_Oji4z@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 3:41 PM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
> On 03/18/2011 08:29 PM, Simon Riggs wrote:
>> We could do that easily enough, actually, if we wished.
>>
>> Do we wish?
>
> I personally don't see any problem letting a standby show a snapshot
> before the master.  I'd consider it unneeded network traffic.  But then
> again, I'm completely biased.

In fact, we *need* to have standbys show a snapshot before the master.

By the time the master acks the commit to the client, the snapshot
must be visible to all client connected to both the master and the
syncronous slave.

Even with just a single server postgresql cluster, other
clients(backends) can see the commit before the commiting client
receives the ACK. Just that on a single server, the time period for
that is small.

Sync rep increases that time period by the length of time from when
the slave reaches the commit point in the WAL stream to when it's ack
of that point get's back to the wal sender. Ideally, that ACK time is
small.

Adding another round trip in there just for a "go almost to $COMIT,
ok, now go to $COMMIT" type of WAL/ack is going to be pessimal for
performance, and still not improve the *guarentees* it can make.

It can only slightly reduce, but not eliminated that window where them
master has WAL that the slave doesn't, and without a complete
elimination (where you just switch the problem to be the slave has the
data that the master doesn't), you haven't changed any of the
guarantees sync rep can make (or not).

a.

--
Aidan Van Dyk                                             Create like a god,
aidan(at)highrise(dot)ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Markus Wanner <markus(at)bluegap(dot)ch>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 21:18:28
Message-ID: AANLkTinaE8pKZ5RyVShF8ZwDF2+GGQ1bZ_Y-jfO2rLAK@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 3:29 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:
>> Simon,
>>
>> On 03/18/2011 05:19 PM, Simon Riggs wrote:
>> >>> Simon Riggs<simon(at)2ndQuadrant(dot)com>  wrote:
>> >>>> In PostgreSQL other users cannot observe the commit until an
>> >>>> acknowledgement has been received.
>>
>> On other nodes as well?  To me that means the standby needs to hold back
>> COMMIT of an ACKed transaction, until receives a re-ACK from the master,
>> that it committed the transaction there.  How else could the slave know
>> when to commit its ACKed transactions?
>
> We could do that easily enough, actually, if we wished.
>
> Do we wish?

Seems like it would be nice, but isn't it dreadfully expensive?
Wouldn't you need to prevent the slave from applying the WAL until the
master has released the sync rep waiters? You'd need a whole new
series of messages back and forth.

Since the current solution is intended to support data-loss-free
failover, but NOT to guarantee a consistent view of the world from a
SQL level, I doubt it's worth paying any price for this. Certainly in
the hot_standby=off case it's a nonissue. We might need to think
harder about it when and if someone impements an 'apply' level though,
because this would seem more of a concern in that case (though I
haven't thought through all the details).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Markus Wanner" <markus(at)bluegap(dot)ch>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "MARK CALLAGHAN" <mdcallag(at)gmail(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 21:24:03
Message-ID: 4D838723020000250003BAFB@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Since the current solution is intended to support data-loss-free
> failover, but NOT to guarantee a consistent view of the world from
> a SQL level, I doubt it's worth paying any price for this.

Well, that brings us back to the question of why we would want to
suppress the view of the data on the master until the replica
acknowledges the commit. It *is* committed on the master, we're
just holding off on telling the committer about it until we can
honor the guarantee of replication. If it can be seen on the
replica before the committer get such acknowledgment, why not on the
master?

-Kevin


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc: Markus Wanner <markus(at)bluegap(dot)ch>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 21:26:00
Message-ID: 1300483560.18619.19276.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, 2011-03-18 at 17:08 -0400, Aidan Van Dyk wrote:
> On Fri, Mar 18, 2011 at 3:41 PM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
> > On 03/18/2011 08:29 PM, Simon Riggs wrote:
> >> We could do that easily enough, actually, if we wished.
> >>
> >> Do we wish?
> >
> > I personally don't see any problem letting a standby show a snapshot
> > before the master. I'd consider it unneeded network traffic. But then
> > again, I'm completely biased.
>
> In fact, we *need* to have standbys show a snapshot before the master.
>
> By the time the master acks the commit to the client, the snapshot
> must be visible to all client connected to both the master and the
> syncronous slave.
>
> Even with just a single server postgresql cluster, other
> clients(backends) can see the commit before the commiting client
> receives the ACK. Just that on a single server, the time period for
> that is small.
>
> Sync rep increases that time period by the length of time from when
> the slave reaches the commit point in the WAL stream to when it's ack
> of that point get's back to the wal sender. Ideally, that ACK time is
> small.
>
> Adding another round trip in there just for a "go almost to $COMIT,
> ok, now go to $COMMIT" type of WAL/ack is going to be pessimal for
> performance, and still not improve the *guarentees* it can make.
>
> It can only slightly reduce, but not eliminated that window where them
> master has WAL that the slave doesn't, and without a complete
> elimination (where you just switch the problem to be the slave has the
> data that the master doesn't), you haven't changed any of the
> guarantees sync rep can make (or not).

Well explained observation. Agreed.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 21:30:04
Message-ID: 1300483804.18619.19321.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, 2011-03-18 at 16:24 -0500, Kevin Grittner wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> > Since the current solution is intended to support data-loss-free
> > failover, but NOT to guarantee a consistent view of the world from
> > a SQL level, I doubt it's worth paying any price for this.
>
> Well, that brings us back to the question of why we would want to
> suppress the view of the data on the master until the replica
> acknowledges the commit. It *is* committed on the master, we're
> just holding off on telling the committer about it until we can
> honor the guarantee of replication. If it can be seen on the
> replica before the committer get such acknowledgment, why not on the
> master?

I think the issue is explicit acknowledgement, not visibility.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 21:43:32
Message-ID: AANLkTin3y0VzPADjNmXYWF5t_ou7ywqHARJhaxbQ=sTR@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 5:24 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> Since the current solution is intended to support data-loss-free
>> failover, but NOT to guarantee a consistent view of the world from
>> a SQL level, I doubt it's worth paying any price for this.
>
> Well, that brings us back to the question of why we would want to
> suppress the view of the data on the master until the replica
> acknowledges the commit.  It *is* committed on the master, we're
> just holding off on telling the committer about it until we can
> honor the guarantee of replication.  If it can be seen on the
> replica before the committer get such acknowledgment, why not on the
> master?

Well, the idea is that we don't want to let people depend on the value
until it's guaranteed to be durably committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Markus Wanner" <markus(at)bluegap(dot)ch>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "MARK CALLAGHAN" <mdcallag(at)gmail(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 21:48:33
Message-ID: 4D838CE1020000250003BB06@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Well, the idea is that we don't want to let people depend on the
> value until it's guaranteed to be durably committed.

OK, so if you see it on the replica, you know it is in at least two
places. I guess that makes sense. It kinda "feels" wrong to see a
view of the replica which is ahead of the master, but I guess it's
the least of the evils. I guess we should document it, though, so
nobody has a false expectation that seeing something on the replica
means that a connection looking at the master will see something
that current.

-Kevin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-18 22:47:54
Message-ID: AANLkTik6tVXXsY79SepR-rP1M9sN=_-RPwo7W+cfMbAj@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 5:48 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Well, the idea is that we don't want to let people depend on the
>> value until it's guaranteed to be durably committed.
>
> OK, so if you see it on the replica, you know it is in at least two
> places.  I guess that makes sense.  It kinda "feels" wrong to see a
> view of the replica which is ahead of the master, but I guess it's
> the least of the evils.  I guess we should document it, though, so
> nobody has a false expectation that seeing something on the replica
> means that a connection looking at the master will see something
> that current.

Yeah, it can go both ways: a snapshot taken on the standby can be
either earlier or later in the commit ordering than the master.
That's counterintuitive, but I see no reason to stress about it. It's
perfectly reasonable to set up a server with synchronous replication
for enhanced durability and also enable hot standby just for
convenience, but without actually relying on it all that heavily, or
only for non-critical reporting purposes. Synchronous replication,
like asynchronous replication, is basically a high-availability tool.
As long as it does that well, I'm not going to get worked up about the
fact that it doesn't address every other use case someone might want.
We can always add more frammishes in future releases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-19 01:12:52
Message-ID: 27265.1300497172@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> As a side note, it's not very obvious why some parts of PostmasterMain
> report problems by doing write_stderr() and exit() while other parts
> use ereport(ERROR). This check and the nearby checks on WAL level are
> immediately preceded and followed by other checks that use the
> opposite technique.

This question is answered in postmaster.c's header comment:

* Error Reporting:
* Use write_stderr() only for reporting "interactive" errors
* (essentially, bogus arguments on the command line). Once the
* postmaster is launched, use ereport(). In particular, don't use
* write_stderr() for anything that occurs after pmdaemonize.

Code that is involved in GUC variable processing is in a gray area, though,
since it can be invoked both before and after pmdaemonize. It might be
a good idea to convert all the calls into ereports and maintain a state
flag in elog.c to determine what to do.

regards, tom lane


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-19 19:27:04
Message-ID: 4D850388.5000401@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 03/18/2011 10:48 PM, Kevin Grittner wrote:
> the least of the evils. I guess we should document it, though, so
> nobody has a false expectation that seeing something on the replica
> means that a connection looking at the master will see something
> that current.

Agreed. Note, however, that even if there's no such guarantee, it's
highly unlikely for a user (or application) to ever notice this during
normal operation.

Regards

Markus Wanner


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc: Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-22 20:33:11
Message-ID: AANLkTi=bD6iWZNggors9ooG-ZjXwxy6b=DPMWf160WNO@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 5:08 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> On Fri, Mar 18, 2011 at 3:41 PM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
>> On 03/18/2011 08:29 PM, Simon Riggs wrote:
>>> We could do that easily enough, actually, if we wished.
>>>
>>> Do we wish?
>>
>> I personally don't see any problem letting a standby show a snapshot
>> before the master.  I'd consider it unneeded network traffic.  But then
>> again, I'm completely biased.
>
> In fact, we *need* to have standbys show a snapshot before the master.
>
> By the time the master acks the commit to the client, the snapshot
> must be visible to all client connected to both the master and the
> syncronous slave.

We might have a version of synchronous replication that works this way
some day, but it's not the version were shipping with 9.1. The slave
acknowledges the WAL records when they hit the disk (i.e. fsync) not
when they are applied; WAL apply can lag arbitrarily. The point is to
guarantee clients that the WAL is on disk somewhere and that it will
be replayed in the event of a failover. Despite the fact that this
doesn't work as you're describing, it's a useful feature in its own
right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Aidan Van Dyk <aidan(at)highrise(dot)ca>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-23 07:27:13
Message-ID: 4D89A0D1.4030601@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 03/22/2011 09:33 PM, Robert Haas wrote:
> We might have a version of synchronous replication that works this way
> some day, but it's not the version were shipping with 9.1. The slave
> acknowledges the WAL records when they hit the disk (i.e. fsync) not
> when they are applied; WAL apply can lag arbitrarily. The point is to
> guarantee clients that the WAL is on disk somewhere and that it will
> be replayed in the event of a failover. Despite the fact that this
> doesn't work as you're describing, it's a useful feature in its own
> right.

In that sense, our approach may be more synchronous than most others,
because after the ACK is sent from the slave, the slave still needs to
apply the transaction data from WAL before it gets visible, while the
master needs to wait for the ACK to arrive at its side, before making it
visible there.

Ideally, these two latencies (disk seek and network induced) are just
about equal. But of course, there's no such guarantee. So whenever one
of the two is off by an order of magnitude or two (by use case or due to
a temporary overload), either the master or the slave may lag behind the
other machine.

What pleases me is that the guarantee from the slave is somewhat similar
to Postgres-R's: with its ACK, the receiving node doesn't guarantee the
transaction *is* applied locally, it just guarantees that it *will* be
able to do so sometime in the future. Kind of a mind twister, though...

Regards

Markus


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: Aidan Van Dyk <aidan(at)highrise(dot)ca>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-23 11:52:22
Message-ID: AANLkTik0WoT+qwh09EUetgFVZy8zpm1Sc3hspa4ddJjs@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Wed, Mar 23, 2011 at 3:27 AM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
> On 03/22/2011 09:33 PM, Robert Haas wrote:
>> We might have a version of synchronous replication that works this way
>> some day, but it's not the version were shipping with 9.1.  The slave
>> acknowledges the WAL records when they hit the disk (i.e. fsync) not
>> when they are applied; WAL apply can lag arbitrarily.  The point is to
>> guarantee clients that the WAL is on disk somewhere and that it will
>> be replayed in the event of a failover.  Despite the fact that this
>> doesn't work as you're describing, it's a useful feature in its own
>> right.
>
> In that sense, our approach may be more synchronous than most others,
> because after the ACK is sent from the slave, the slave still needs to
> apply the transaction data from WAL before it gets visible, while the
> master needs to wait for the ACK to arrive at its side, before making it
> visible there.
>
> Ideally, these two latencies (disk seek and network induced) are just
> about equal.  But of course, there's no such guarantee.  So whenever one
> of the two is off by an order of magnitude or two (by use case or due to
> a temporary overload), either the master or the slave may lag behind the
> other machine.
>
> What pleases me is that the guarantee from the slave is somewhat similar
> to Postgres-R's: with its ACK, the receiving node doesn't guarantee the
> transaction *is* applied locally, it just guarantees that it *will* be
> able to do so sometime in the future.  Kind of a mind twister, though...

Yes. What this won't do is let you build a big load-balancing network
(at least not without great caution about what you assume). What it
will do is make it really, really hard to lose committed transactions.
Both good things, but different.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Aidan Van Dyk <aidan(at)highrise(dot)ca>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-23 12:16:39
Message-ID: 4D89E4A7.4090900@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 03/23/2011 12:52 PM, Robert Haas wrote:
> Yes. What this won't do is let you build a big load-balancing network
> (at least not without great caution about what you assume).

This sounds too strong to me. Session-aware load balancing is pretty
common these days. It's the default mode of PgBouncer, for example.
Not much caution required there, IMO. Or what pitfalls did you have in
mind?

> What it
> will do is make it really, really hard to lose committed transactions.
> Both good things, but different.

..you can still get both at the same time. At least as long as you are
happy with session-aware load balancing. And who really needs finer
grained balancing?

(Note that no matter how fine-grained you balance, you are still bound
to a (single core of a) single node. That changes with distributed
querying, and things really start to get interesting there... but we are
far from that, yet).

Regards

Markus


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: Aidan Van Dyk <aidan(at)highrise(dot)ca>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-23 15:24:30
Message-ID: AANLkTiki6M8=CmQc-5aPF0HTN9CRY2vuTLuPJusXJjK5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Wed, Mar 23, 2011 at 8:16 AM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
> On 03/23/2011 12:52 PM, Robert Haas wrote:
>> Yes.  What this won't do is let you build a big load-balancing network
>> (at least not without great caution about what you assume).
>
> This sounds too strong to me.  Session-aware load balancing is pretty
> common these days.  It's the default mode of PgBouncer, for example.
> Not much caution required there, IMO.  Or what pitfalls did you have in
> mind?

Well, just the one we were talking about: a COMMIT on one node doesn't
guarantee that the transactions is visible on the other node, just
that it will become visible there eventually, even if a crash happens.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-23 15:53:30
Message-ID: AANLkTinvc5-B16cVtw2E5of5n7JY_UOiUJYx4U8GSPuB@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Fri, Mar 18, 2011 at 10:10 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Mar 7, 2011 at 3:44 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Mon, Mar 7, 2011 at 5:27 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>> On Mon, Mar 7, 2011 at 7:51 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>>> Efficient transaction-controlled synchronous replication.
>>>> If a standby is broadcasting reply messages and we have named
>>>> one or more standbys in synchronous_standby_names then allow
>>>> users who set synchronous_replication to wait for commit, which
>>>> then provides strict data integrity guarantees. Design avoids
>>>> sending and receiving transaction state information so minimises
>>>> bookkeeping overheads. We synchronize with the highest priority
>>>> standby that is connected and ready to synchronize. Other standbys
>>>> can be defined to takeover in case of standby failure.
>>>>
>>>> This version has very strict behaviour; more relaxed options
>>>> may be added at a later date.
>>>
>>> Pretty cool! I'd appreciate very much your efforts and contributions.
>>>
>>> And,, I found one bug ;) You seem to have wrongly removed the check
>>> of max_wal_senders in SyncRepWaitForLSN. This can make the
>>> backend wait for replication even if max_wal_senders = 0. I could produce
>>> this problematic situation in my machine. The attached patch fixes this problem.
>>
>>        if (strlen(SyncRepStandbyNames) > 0 && max_wal_senders == 0)
>>                ereport(ERROR,
>>                                (errmsg("Synchronous replication requires WAL streaming
>> (max_wal_senders > 0)")));
>>
>> The above check should be required also after pg_ctl reload since
>> synchronous_standby_names can be changed by SIGHUP?
>> Or how about just removing that? If the patch I submitted is
>> committed,empty synchronous_standby_names and max_wal_senders = 0
>> settings is no longer unsafe.
>
> This configuration is now harmless in the sense that it no longer
> horribly breaks the entire system, but it's still pretty useless, so
> this might be deemed a valuable sanity check.  However, I'm reluctant
> to leave it in there, because someone could change their config to
> this state, pg_ctl reload, see everything working, and then later stop
> the cluster and be unable to start it back up again.  Since most
> people don't shut their database systems down very often, they might
> not discover that they have an invalid config until much later.  I
> think it's probably not a good idea to have configs that are valid on
> reload but prevent startup, so I'm inclined to either remove this
> check altogether or downgrade it to a warning.

Done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-25 11:53:15
Message-ID: AANLkTi=29RMsJpnkTgG8p=LzaS25d0KrRcGWSThctLUk@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Sat, Mar 19, 2011 at 12:07 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 18, 2011 at 10:55 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>> On Thu, Mar 17, 2011 at 5:46 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> What makes more sense to me after having thought about this more
>>> carefully is to simply make a blanket rule that when
>>> synchronous_replication=on, synchronous_commit has no effect.  That is
>>> easy to understand and document.
>>
>> For what it's worth "has no effect" doesn't make much sense to me.
>> It's a boolean, either commits are going to block or they're not.
>>
>> What happened to the idea of a three-way switch?
>>
>> synchronous_commit = off
>> synchronous_commit = disk
>> synchronous_commit = replica
>>
>> With "on" being a synonym for "disk" for backwards compatibility.
>>
>> Then we could add more options later for more complex conditions like
>> waiting for one server in each data centre or waiting for one of a
>> certain set of servers ignoring the less reliable mirrors, etc.
>
> This is similar to what I suggested upthread, except that I suggested
> on/local/off, with the default being on.  That way if you set
> synchronous_standby_names, you get synchronous replication without
> changing another setting, but you can say local instead if for some
> reason you want the middle behavior.  If we're going to do it all with
> one GUC, I think that way makes more sense.  If you're running sync
> rep, you might still have some transactions that you don't care about,
> but that's what async commit is for.  It's a funny kind of transaction
> that we're OK with losing if we have a failover but we're not OK with
> losing if we have a local crash from which we recover without failing
> over.

I'm OK with this.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Markus Wanner <markus(at)bluegap(dot)ch>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, MARK CALLAGHAN <mdcallag(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Date: 2011-03-25 12:12:05
Message-ID: AANLkTinC2+OuVQdHWVRfDLOTA7Z1n2JZdbUsrgCmvmJN@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Sat, Mar 19, 2011 at 4:29 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:
>> Simon,
>>
>> On 03/18/2011 05:19 PM, Simon Riggs wrote:
>> >>> Simon Riggs<simon(at)2ndQuadrant(dot)com>  wrote:
>> >>>> In PostgreSQL other users cannot observe the commit until an
>> >>>> acknowledgement has been received.
>>
>> On other nodes as well?  To me that means the standby needs to hold back
>> COMMIT of an ACKed transaction, until receives a re-ACK from the master,
>> that it committed the transaction there.  How else could the slave know
>> when to commit its ACKed transactions?
>
> We could do that easily enough, actually, if we wished.
>
> Do we wish?

No.

I'm not sure what's the problem with seeing from the standby the data which is
not visible on the master yet? And, I'm really not sure whether that problem can
be solved by making the data visible on the master before the standby. If we
really want to see the consistent data from each node, we should implement
and use a cluster-wide snapshot as well as Postgres-XC does.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center