Quick Links

Re: Synchronous replication

Lists:	pgsql-hackers

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Synchronous replication
Date:	2010-07-14 06:50:13
Message-ID:	AANLkTilgyL3Y1jkDVHX02433COq7JLmqicsqmOsbuyA1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

The attached patch provides core of synchronous replication feature
based on streaming replication. I added this patch into CF 2010-07.

The code is also available in my git repository:
git://git.postgresql.org/git/users/fujii/postgres.git
branch: synchrep

Synchronization levels
----------------------
The patch provides replication_mode parameter in recovery.conf, which
specifies the replication mode which can control how long transaction
commit on the master server waits for replication before the command
returns a "success" indication to the client. Valid modes are:

1. async
doesn't make transaction commit wait for replication, i.e.,
asynchronous replication. This mode has been already supported in
9.0.

2. recv
makes transaction commit wait until the standby has received WAL
records.

3. fsync
makes transaction commit wait until the standby has received and
flushed WAL records to disk

4. replay
makes transaction commit wait until the standby has replayed WAL
records after receiving and flushing them to disk

You can choose the synchronization level per standby.

Quorum commit
-------------
In previous discussion about synchronous replication, some people
wanted the quorum commit feature. This feature is included in also
Zontan's synchronous replication patch, so I decided to create it.

The patch provides quorum parameter in postgresql.conf, which
specifies how many standby servers transaction commit will wait for
WAL records to be replicated to, before the command returns a
"success" indication to the client. The default value is zero, which
always doesn't make transaction commit wait for replication without
regard to replication_mode. Also transaction commit always doesn't
wait for replication to asynchronous standby (i.e., replication_mode
is set to async) without regard to this parameter. If quorum is more
than the number of synchronous standbys, transaction commit returns
a "success" when the ACK has arrived from all of synchronous standbys.

Currently quorum parameter is defined as PGC_USERSET. You can have
some transactions replicate synchronously and others asynchronously.

Protocol
--------
I extended the handshake message "START_REPLICATION" so that it
includes replication_mode read from recovery.conf. If 'async' is
passed, the master thinks that it doesn't need to wait for the ACK
from the standby.

I added XLogRecPtr message, which is used to send the ACK meaning
completion of replication from walreceiver to walsender. If
replication_mode = 'async', this message is never sent. XLogRecPtr
message always includes the current receive location if mode is 'recv',
the current flush location if mode is 'fsync' and the current replay
location if mode is 'replay'.

Then, if the location in the ACK is more than or equal to the
location of the COMMIT record, transaction breaks out of the wait-loop
and returns a "success" to the client.

TODO
----
The patch have no features for performance improvement of synchronous
replication. I admit that currently the performance overhead in the
master is terrible. We need to address the following TODO items in the
subsequent CF.

* Change the poll loop in the walsender
* Change the poll loop in the backend
* Change the poll loop in the startup process
* Change the poll loop in the walreceiver
* Perform the WAL write and replication concurrently
* Send WAL from not only disk but also WAL buffers

For the case where the network outage happens or the standby fails, we
should expose the maximum time to wait for replication, as a parameter.
Furthermore you might want to specify the reaction to the timeout. These
are also not in the patch, so we need to address them in the subsequent
CF, too.

In synchronous replication, it's important to check whether the standby
has been sync with the master. But such a monitoring feature is also not
in the patch. That's TODO.

It would be difficult to commit whole of synchronous replication feature
at one time. I'm planning to develop it by stages.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment	Content-Type	Size
synch_rep_0714.patch	application/octet-stream	50.8 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-14 15:16:01
Message-ID:	AANLkTikBAP6_Ky7PWP0iRu9DIKkJCK9LuGHrDIHxihvz@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 14, 2010 at 2:50 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> The patch have no features for performance improvement of synchronous
> replication. I admit that currently the performance overhead in the
> master is terrible. We need to address the following TODO items in the
> subsequent CF.
>
> * Change the poll loop in the walsender
> * Change the poll loop in the backend
> * Change the poll loop in the startup process
> * Change the poll loop in the walreceiver
> * Perform the WAL write and replication concurrently
> * Send WAL from not only disk but also WAL buffers

I have a feeling that if we don't have a design for these last two
before we start committing things, we're possibly going to regret it
later.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-16 07:40:23
Message-ID:	AANLkTikCdC2IJeh5fGHYvhmAcLOfTF31GfQRjSMBlaVl@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 15, 2010 at 12:16 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Jul 14, 2010 at 2:50 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> The patch have no features for performance improvement of synchronous
>> replication. I admit that currently the performance overhead in the
>> master is terrible. We need to address the following TODO items in the
>> subsequent CF.
>>
>> * Change the poll loop in the walsender
>> * Change the poll loop in the backend
>> * Change the poll loop in the startup process
>> * Change the poll loop in the walreceiver
>> * Perform the WAL write and replication concurrently
>> * Send WAL from not only disk but also WAL buffers
>
> I have a feeling that if we don't have a design for these last two
> before we start committing things, we're possibly going to regret it
> later.

Yeah, I'll give it a try.

The problem is that the standby can apply the non-fsync'd WAL on the
master. So if we allow walsender to send the non-fsync'd WAL, we should
make walsender send also the current fsync location and prevent the
standby from applying the newer WAL than the fsync location.

New message type for sending the fsync location would be required in
Streaming Replication Protocol. But sometimes it might go along with
XLogData message.

After the master crashes and walreceiver is terminated, currently the
standby attempts to replay the WAL in the pg_xlog and the archive.
Since WAL in the archive is guaranteed to have already been fsync'd by
the master, it's not problem for the standby to apply that WAL. OTOH,
WAL records in pg_xlog directory might not exist in the crashed master.
So we should always prevent the standby from applying any WAL in pg_xlog
unless walreceiver is in progress. That is, if there is no WAL available
in the archive, the standby ignores pg_xlog and starts walreceiver
process to request for WAL streaming.

This idea is a little inefficient because the already-sent WAL might
be sent again when the master is restarted. But since this ensures
that the standby will not apply the non-fsync'd WAL on the master,
it's quite safe.

What about this idea?

This idea doesn't conflict with the patch I submitted for CF 2010-07.
So please feel free to review the patch :) But if you think that the
patch is not reviewable until that idea has been implemented, I'll
try to implement that ASAP.

PS. Probably I cannot reply to the mail until July 21. Sorry.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-16 10:43:50
Message-ID:	4C4037E6.1000300@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 16/07/10 10:40, Fujii Masao wrote:
> So we should always prevent the standby from applying any WAL in pg_xlog
> unless walreceiver is in progress. That is, if there is no WAL available
> in the archive, the standby ignores pg_xlog and starts walreceiver
> process to request for WAL streaming.

That completely defeats the purpose of storing streamed WAL in pg_xlog
in the first place. The reason it's written and fsync'd to pg_xlog is
that if the standby subsequently crashes, you can use the WAL from
pg_xlog to reapply the WAL up to minRecoveryPoint. Otherwise you can't
start up the standby anymore.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-16 17:26:27
Message-ID:	E3662D1B-98F0-4CA4-A23B-5DE4AF2221E7@hi-media.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Le 16 juil. 2010 à 12:43, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> a écrit :

> On 16/07/10 10:40, Fujii Masao wrote:
>> So we should always prevent the standby from applying any WAL in pg_xlog
>> unless walreceiver is in progress. That is, if there is no WAL available
>> in the archive, the standby ignores pg_xlog and starts walreceiver
>> process to request for WAL streaming.
>
> That completely defeats the purpose of storing streamed WAL in pg_xlog in the first place. The reason it's written and fsync'd to pg_xlog is that if the standby subsequently crashes, you can use the WAL from pg_xlog to reapply the WAL up to minRecoveryPoint. Otherwise you can't start up the standby anymore.

I guess we know for sure that this point has been fsync()ed on the Master, or that we could arrange it so that we know that?

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-16 18:22:22
Message-ID:	4C40A35E.1040502@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 16/07/10 20:26, Dimitri Fontaine wrote:
> Le 16 juil. 2010 à 12:43, Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> a écrit :
>
>> On 16/07/10 10:40, Fujii Masao wrote:
>>> So we should always prevent the standby from applying any WAL in pg_xlog
>>> unless walreceiver is in progress. That is, if there is no WAL available
>>> in the archive, the standby ignores pg_xlog and starts walreceiver
>>> process to request for WAL streaming.
>>
>> That completely defeats the purpose of storing streamed WAL in pg_xlog in the first place. The reason it's written and fsync'd to pg_xlog is that if the standby subsequently crashes, you can use the WAL from pg_xlog to reapply the WAL up to minRecoveryPoint. Otherwise you can't start up the standby anymore.
>
> I guess we know for sure that this point has been fsync()ed on the Master, or that we could arrange it so that we know that?

At the moment we only stream WAL that's already been fsync()ed on the
master, so we don't have this problem, but Fujii is proposing to change
that.

I think that's a premature optimization, and we should not try to change
that. There is no evidence from field (granted, streaming replication is
a new feature) or from performance tests that it is a problem in
practice, or that sending WAL earlier would help. Let's concentrate on
the bare minimum required to make synchronous replication work.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-16 18:25:21
Message-ID:	4C40A411.7030303@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 14/07/10 09:50, Fujii Masao wrote:
> TODO
> ----
> The patch have no features for performance improvement of synchronous
> replication. I admit that currently the performance overhead in the
> master is terrible. We need to address the following TODO items in the
> subsequent CF.
>
> * Change the poll loop in the walsender
> * Change the poll loop in the backend
> * Change the poll loop in the startup process
> * Change the poll loop in the walreceiver

I was actually hoping to see a patch for these things first, before any
of the synchronous replication stuff. Eliminating the polling loops is
important, latency will be laughable otherwise, and it will help the
synchronous case too.

> * Perform the WAL write and replication concurrently
> * Send WAL from not only disk but also WAL buffers

IMHO these are premature optimizations that we should not spend any
effort on now. Maybe later, if ever.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-17 18:14:56
Message-ID:	4C41F320.7020200@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 14/07/10 09:50, Fujii Masao wrote:
> Quorum commit
> -------------
> In previous discussion about synchronous replication, some people
> wanted the quorum commit feature. This feature is included in also
> Zontan's synchronous replication patch, so I decided to create it.
>
> The patch provides quorum parameter in postgresql.conf, which
> specifies how many standby servers transaction commit will wait for
> WAL records to be replicated to, before the command returns a
> "success" indication to the client. The default value is zero, which
> always doesn't make transaction commit wait for replication without
> regard to replication_mode. Also transaction commit always doesn't
> wait for replication to asynchronous standby (i.e., replication_mode
> is set to async) without regard to this parameter. If quorum is more
> than the number of synchronous standbys, transaction commit returns
> a "success" when the ACK has arrived from all of synchronous standbys.

There should be a way to specify "wait for *all* connected standby
servers to acknowledge"

> Protocol
> --------
> I extended the handshake message "START_REPLICATION" so that it
> includes replication_mode read from recovery.conf. If 'async' is
> passed, the master thinks that it doesn't need to wait for the ACK
> from the standby.

Please use self-explanatory names for the modes in START_REPLICATION
command, instead of just an integer.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-21 06:52:58
Message-ID:	AANLkTim_ZXAorqwHqyvOuRH5ZP=NPNn02zU2Tw=SFyfr@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 16, 2010 at 7:43 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 16/07/10 10:40, Fujii Masao wrote:
>>
>> So we should always prevent the standby from applying any WAL in pg_xlog
>> unless walreceiver is in progress. That is, if there is no WAL available
>> in the archive, the standby ignores pg_xlog and starts walreceiver
>> process to request for WAL streaming.
>
> That completely defeats the purpose of storing streamed WAL in pg_xlog in
> the first place. The reason it's written and fsync'd to pg_xlog is that if
> the standby subsequently crashes, you can use the WAL from pg_xlog to
> reapply the WAL up to minRecoveryPoint. Otherwise you can't start up the
> standby anymore.

But, the standby can start up by reading the missing WAL files from the
master. No?

On the second thought, minRecoveryPoint can be guaranteed to be older
than the fsync location on the master if we'll prevent the standby from
applying the WAL files more than the fsync location. So we can safely
apply the WAL files in pg_xlog up to minRecoveryPoint.

Consequently, we should always prevent the standby from applying any
newer WAL in pg_xlog than minRecoveryPoint unless walreceiver is in
progress. Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-21 07:36:11
Message-ID:	AANLkTim0KunmNj_zbNg1R8sbBv=GppgF=9fBrMn4hxET@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jul 17, 2010 at 3:25 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 14/07/10 09:50, Fujii Masao wrote:
>>
>> TODO
>> ----
>> The patch have no features for performance improvement of synchronous
>> replication. I admit that currently the performance overhead in the
>> master is terrible. We need to address the following TODO items in the
>> subsequent CF.
>>
>> * Change the poll loop in the walsender
>> * Change the poll loop in the backend
>> * Change the poll loop in the startup process
>> * Change the poll loop in the walreceiver
>
> I was actually hoping to see a patch for these things first, before any of
> the synchronous replication stuff. Eliminating the polling loops is
> important, latency will be laughable otherwise, and it will help the
> synchronous case too.

At first, note that the poll loop in the backend and walreceiver doesn't
exist without synchronous replication stuff.

Yeah, I'll start with the change of the poll loop in the walsender. I'm
thinking that we should make the backend signal the walsender to send the
outstanding WAL immediately as the previous synchronous replication patch
I submitted in the past year did. I use the signal here because walsender
needs to wait for the request from the backend and the ack message from
the standby *concurrently* in synchronous replication. If we use the
semaphore instead of the signal, the walsender would not be able to
respond the ack immediately, which also degrades the performance.

The problem of this idea is that signal can be sent per transaction commit.
I'm not sure if this frequent signaling really harms the performance of
replication. BTW, when I benchmarked the previous synchronous replication
patch based on the idea, AFAIR the result showed no impact of the
signaling. But... Thought? Do you have another better idea?

>> * Perform the WAL write and replication concurrently
>> * Send WAL from not only disk but also WAL buffers
>
> IMHO these are premature optimizations that we should not spend any effort
> on now. Maybe later, if ever.

Yep!

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-21 07:48:57
Message-ID:	AANLkTikuU2=e1ZXkA=9AuA44WvaHCGuVsHPbS_XRtSJT@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jul 18, 2010 at 3:14 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 14/07/10 09:50, Fujii Masao wrote:
>>
>> Quorum commit
>> -------------
>> In previous discussion about synchronous replication, some people
>> wanted the quorum commit feature. This feature is included in also
>> Zontan's synchronous replication patch, so I decided to create it.
>>
>> The patch provides quorum parameter in postgresql.conf, which
>> specifies how many standby servers transaction commit will wait for
>> WAL records to be replicated to, before the command returns a
>> "success" indication to the client. The default value is zero, which
>> always doesn't make transaction commit wait for replication without
>> regard to replication_mode. Also transaction commit always doesn't
>> wait for replication to asynchronous standby (i.e., replication_mode
>> is set to async) without regard to this parameter. If quorum is more
>> than the number of synchronous standbys, transaction commit returns
>> a "success" when the ACK has arrived from all of synchronous standbys.
>
> There should be a way to specify "wait for *all* connected standby servers
> to acknowledge"

Agreed. I'll allow -1 as the valid value of the quorum parameter, which
means that transaction commit waits for all connected standbys.

>> Protocol
>> --------
>> I extended the handshake message "START_REPLICATION" so that it
>> includes replication_mode read from recovery.conf. If 'async' is
>> passed, the master thinks that it doesn't need to wait for the ACK
>> from the standby.
>
> Please use self-explanatory names for the modes in START_REPLICATION
> command, instead of just an integer.

Agreed. What about changing the START_REPLICATION message to?:

START_REPLICATION XXX/XXX SYNC_LEVEL { async | recv | fsync | replay }

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-21 12:52:40
Message-ID:	20100721125240.GI6886@oak.highrise.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Fujii Masao <masao(dot)fujii(at)gmail(dot)com> [100721 03:49]:

> >> The patch provides quorum parameter in postgresql.conf, which
> >> specifies how many standby servers transaction commit will wait for
> >> WAL records to be replicated to, before the command returns a
> >> "success" indication to the client. The default value is zero, which
> >> always doesn't make transaction commit wait for replication without
> >> regard to replication_mode. Also transaction commit always doesn't
> >> wait for replication to asynchronous standby (i.e., replication_mode
> >> is set to async) without regard to this parameter. If quorum is more
> >> than the number of synchronous standbys, transaction commit returns
> >> a "success" when the ACK has arrived from all of synchronous standbys.
> >
> > There should be a way to specify "wait for *all* connected standby servers
> > to acknowledge"
>
> Agreed. I'll allow -1 as the valid value of the quorum parameter, which
> means that transaction commit waits for all connected standbys.

Hm... so if my 1 synchronouse standby is operatign normally, and quarum
is set to 1, I'll get what I want (commit waits until it's safely on both
servers). But what happens if my standby goes bad. Suddenly the quarum
setting is ignored (because it's > number of connected standby
servers?) Is there a way for me to not allow any commits if the quarum
setting number of standbies is *not* availble? Yes, I want my db to
"halt" in that situation, and yes, alarmbells will be ringing...

In reality, I'm likely to run 2 synchronous slaves, with quarum of 1.
So 1 slave can fail an dI can still have 2 going. But if that 2nd slave
ever failed while the other was down, I definately don't want the master
to forge on ahead!

Of course, this won't be for everyone, just as the current "just
connected standbys" isn't for everything either...

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-22 02:51:10
Message-ID:	AANLkTimwA0_RphV-_mNsr3+FN==sb68dtYtAcBLS0+bT@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 21, 2010 at 9:52 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> * Fujii Masao <masao(dot)fujii(at)gmail(dot)com> [100721 03:49]:
>
>> >> The patch provides quorum parameter in postgresql.conf, which
>> >> specifies how many standby servers transaction commit will wait for
>> >> WAL records to be replicated to, before the command returns a
>> >> "success" indication to the client. The default value is zero, which
>> >> always doesn't make transaction commit wait for replication without
>> >> regard to replication_mode. Also transaction commit always doesn't
>> >> wait for replication to asynchronous standby (i.e., replication_mode
>> >> is set to async) without regard to this parameter. If quorum is more
>> >> than the number of synchronous standbys, transaction commit returns
>> >> a "success" when the ACK has arrived from all of synchronous standbys.
>> >
>> > There should be a way to specify "wait for *all* connected standby servers
>> > to acknowledge"
>>
>> Agreed. I'll allow -1 as the valid value of the quorum parameter, which
>> means that transaction commit waits for all connected standbys.
>
> Hm... so if my 1 synchronouse standby is operatign normally, and quarum
> is set to 1, I'll get what I want (commit waits until it's safely on both
> servers). But what happens if my standby goes bad. Suddenly the quarum
> setting is ignored (because it's > number of connected standby
> servers?) Is there a way for me to not allow any commits if the quarum
> setting number of standbies is *not* availble? Yes, I want my db to
> "halt" in that situation, and yes, alarmbells will be ringing...
>
> In reality, I'm likely to run 2 synchronous slaves, with quarum of 1.
> So 1 slave can fail an dI can still have 2 going. But if that 2nd slave
> ever failed while the other was down, I definately don't want the master
> to forge on ahead!
>
> Of course, this won't be for everyone, just as the current "just
> connected standbys" isn't for everything either...

Yeah, we need to clear up the detailed design of quorum commit feature,
and reach consensus on that.

How should the synchronous replication behave when the number of connected
standby servers is less than quorum?

1. Ignore quorum. The current patch adopts this. If the ACKs from all
connected standbys have arrived, transaction commit is successful
even if the number of standbys is less than quorum. If there is no
connected standby, transaction commit always is successful without
regard to quorum.

2. Observe quorum. Aidan wants this. Until the number of connected
standbys has become more than or equal to quorum, transaction commit
waits.

Which is the right behavior of quorum commit? Or we should add new
parameter specifying the behavior of quorum commit?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-22 02:54:25
Message-ID:	AANLkTikvdZZy2+_=hGcYBRTx3d1Ty=6FTqh4mESTtQRY@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 21, 2010 at 4:48 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> There should be a way to specify "wait for *all* connected standby servers
>> to acknowledge"
>
> Agreed. I'll allow -1 as the valid value of the quorum parameter, which
> means that transaction commit waits for all connected standbys.

Done.

>> Please use self-explanatory names for the modes in START_REPLICATION
>> command, instead of just an integer.
>
> Agreed. What about changing the START_REPLICATION message to?:
>
> START_REPLICATION XXX/XXX SYNC_LEVEL { async | recv | fsync | replay }

Done.

I attached the updated version of the patch.
The code is also available in my git repository:
git://git.postgresql.org/git/users/fujii/postgres.git
branch: synchrep

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment	Content-Type	Size
synch_rep_0722.patch	application/octet-stream	52.5 KB

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-22 08:37:12
Message-ID:	4C480338.9050801@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> How should the synchronous replication behave when the number of connected
> standby servers is less than quorum?
>
> 1. Ignore quorum. The current patch adopts this. If the ACKs from all
> connected standbys have arrived, transaction commit is successful
> even if the number of standbys is less than quorum. If there is no
> connected standby, transaction commit always is successful without
> regard to quorum.
>
> 2. Observe quorum. Aidan wants this. Until the number of connected
> standbys has become more than or equal to quorum, transaction commit
> waits.
>
> Which is the right behavior of quorum commit? Or we should add new
> parameter specifying the behavior of quorum commit?
>
Initially I also expected the quorum to behave like described by
Aidan/option 2. Also, IMHO the name "quorom" is a bit short, like having
"maximum" but not saying a max_something.

quorum_min_sync_standbys
quorum_max_sync_standbys

The question remains what are the sync standbys? Does it mean not-async?
Intuitively by looking at the enumeration of replication_mode I'd think
that the sync standbys are all standby's that operate in a not async
mode. That would be clearer with a boolean sync (or not) and for sync
standbys the replication_mode specified.

regards,
Yeb Havinga

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-26 06:56:40
Message-ID:	AANLkTikUxoA+OTq1TAq-MaGc53iZc4jxTbPOzdSK2DO1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 22, 2010 at 5:37 PM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
> Fujii Masao wrote:
>>
>> How should the synchronous replication behave when the number of connected
>> standby servers is less than quorum?
>>
>> 1. Ignore quorum. The current patch adopts this. If the ACKs from all
>> connected standbys have arrived, transaction commit is successful
>> even if the number of standbys is less than quorum. If there is no
>> connected standby, transaction commit always is successful without
>> regard to quorum.
>>
>> 2. Observe quorum. Aidan wants this. Until the number of connected
>> standbys has become more than or equal to quorum, transaction commit
>> waits.
>>
>> Which is the right behavior of quorum commit? Or we should add new
>> parameter specifying the behavior of quorum commit?
>>
>
> Initially I also expected the quorum to behave like described by
> Aidan/option 2.

OK. But some people (including me) would like to prevent the master
from halting when the standby fails, so I think that 1. also should
be supported. So I'm inclined to add new parameter specifying the
behavior of quorum commit when the number of synchronous standbys
becomes less than quorum.

> Also, IMHO the name "quorom" is a bit short, like having
> "maximum" but not saying a max_something.
>
> quorum_min_sync_standbys
> quorum_max_sync_standbys

What about quorum_standbys?

> The question remains what are the sync standbys? Does it mean not-async?

It's the standby which sets replication_mode to "recv", "fsync", or "replay".

> Intuitively by looking at the enumeration of replication_mode I'd think that
> the sync standbys are all standby's that operate in a not async mode. That
> would be clearer with a boolean sync (or not) and for sync standbys the
> replication_mode specified.

You mean that something like synchronous_replication as the recovery.conf
parameter should be added in addition to replication_mode? Since increasing
the number of similar parameters would confuse users, I don't like do that.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Subject:	Re: Synchronous replication
Date:	2010-07-26 08:27:39
Message-ID:	4C4D46FB.7040609@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
>> Intuitively by looking at the enumeration of replication_mode I'd think that
>> the sync standbys are all standby's that operate in a not async mode. That
>> would be clearer with a boolean sync (or not) and for sync standbys the
>> replication_mode specified.
>>
>
> You mean that something like synchronous_replication as the recovery.conf
> parameter should be added in addition to replication_mode? Since increasing
> the number of similar parameters would confuse users, I don't like do that.
>
I think what would be confusing if there is a mismatch between
implemented concepts and parameters.

1 does the master wait for standby servers on commit?
2 how many acknowledgements must the master receive before it can continue?
3 is a standby server a synchronous one, i.e. does it acknowledge a commit?
4 when do standby servers acknowledge a commit?
5 does it only wait when the standby's are connected, or also when they
are not connected?
6..?

When trying to match parameter names for the concepts above:
1 - does not exist, but can be answered with quorum_standbys = 0
2 - quorum_standbys
3 - yes, if replication_mode != async (here is were I thought I had to
think to much)
4 - replication modes recv, fsync and replay bot not async
5 - Zoltan's strict_sync_replication parameter

Just an idea, what about
for 4: acknowledge_commit = {no|recv|fsync|replay}
then 3 = yes, if acknowledge_commit != no

regards,
Yeb Havinga

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Subject:	Re: Synchronous replication
Date:	2010-07-26 08:41:49
Message-ID:	AANLkTinrSO0r7ZiNRaNV4jtKpAyjSB5nmZZwZsdEPK6x@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 26, 2010 at 5:27 PM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
> Fujii Masao wrote:
>>>
>>> Intuitively by looking at the enumeration of replication_mode I'd think
>>> that
>>> the sync standbys are all standby's that operate in a not async mode.
>>> That
>>> would be clearer with a boolean sync (or not) and for sync standbys the
>>> replication_mode specified.
>>>
>>
>> You mean that something like synchronous_replication as the recovery.conf
>> parameter should be added in addition to replication_mode? Since
>> increasing
>> the number of similar parameters would confuse users, I don't like do
>> that.
>>
>
> I think what would be confusing if there is a mismatch between implemented
> concepts and parameters.
>
> 1 does the master wait for standby servers on commit?
> 2 how many acknowledgements must the master receive before it can continue?
> 3 is a standby server a synchronous one, i.e. does it acknowledge a commit?
> 4 when do standby servers acknowledge a commit?
> 5 does it only wait when the standby's are connected, or also when they are
> not connected?
> 6..?
>
> When trying to match parameter names for the concepts above:
> 1 - does not exist, but can be answered with quorum_standbys = 0
> 2 - quorum_standbys
> 3 - yes, if replication_mode != async (here is were I thought I had to think
> to much)
> 4 - replication modes recv, fsync and replay bot not async
> 5 - Zoltan's strict_sync_replication parameter
>
> Just an idea, what about
> for 4: acknowledge_commit = {no|recv|fsync|replay}
> then 3 = yes, if acknowledge_commit != no

Thanks for the clarification.

I still like

replication_mode = {async|recv|fsync|replay}

rather than

synchronous_replication = {on|off}
acknowledge_commit = {no|recv|fsync|replay}

because the former is more intuitive for me and I don't want
to increase the number of parameters.

We need to hear from some users in this respect. If most want
the latter, of course, I'd love to adopt it.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Subject:	Re: Synchronous replication
Date:	2010-07-26 09:36:19
Message-ID:	4C4D5713.4050603@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> I still like
>
> replication_mode = {async|recv|fsync|replay}
>
> rather than
>
> synchronous_replication = {on|off}
> acknowledge_commit = {no|recv|fsync|replay}
>
Hello Fujii,

I wasn't entirely clear. My suggestion was to have only

acknowledge_commit = {no|recv|fsync|replay}

instead of

replication_mode = {async|recv|fsync|replay}

regards,
Yeb Havinga

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Subject:	Re: Synchronous replication
Date:	2010-07-26 10:44:46
Message-ID:	AANLkTimEEgmH-zc7DwBH5jmqW64qDsoZR70MByFQn7JF@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 26, 2010 at 6:36 PM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
> Fujii Masao wrote:
>>
>> I still like
>>
>> replication_mode = {async|recv|fsync|replay}
>>
>> rather than
>>
>> synchronous_replication = {on|off}
>> acknowledge_commit = {no|recv|fsync|replay}
>>
>
> Hello Fujii,
>
> I wasn't entirely clear. My suggestion was to have only
>
> acknowledge_commit = {no|recv|fsync|replay}
>
> instead of
>
> replication_mode = {async|recv|fsync|replay}

Okay, I'll change the patch accordingly.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Marko Tiikkaja <marko(dot)tiikkaja(at)cs(dot)helsinki(dot)fi>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Subject:	Re: Synchronous replication
Date:	2010-07-26 10:48:16
Message-ID:	4C4D67F0.5010301@cs.helsinki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 7/26/10 1:44 PM +0300, Fujii Masao wrote:
> On Mon, Jul 26, 2010 at 6:36 PM, Yeb Havinga<yebhavinga(at)gmail(dot)com> wrote:
>> I wasn't entirely clear. My suggestion was to have only
>>
>> acknowledge_commit = {no|recv|fsync|replay}
>>
>> instead of
>>
>> replication_mode = {async|recv|fsync|replay}
>
> Okay, I'll change the patch accordingly.

For what it's worth, I think replication_mode is a lot clearer.
Acknowledge_commit sounds like it would do something similar to
asynchronous_commit.

Regards,
Marko Tiikkaja

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Marko Tiikkaja <marko(dot)tiikkaja(at)cs(dot)helsinki(dot)fi>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Subject:	Re: Synchronous replication
Date:	2010-07-26 11:25:41
Message-ID:	AANLkTimuqWh6-3D-H9ecoC0FFqMC51DLLfh4v6xgKXbm@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 26, 2010 at 6:48 AM, Marko Tiikkaja
<marko(dot)tiikkaja(at)cs(dot)helsinki(dot)fi> wrote:
> On 7/26/10 1:44 PM +0300, Fujii Masao wrote:
>>
>> On Mon, Jul 26, 2010 at 6:36 PM, Yeb Havinga<yebhavinga(at)gmail(dot)com> wrote:
>>>
>>> I wasn't entirely clear. My suggestion was to have only
>>>
>>> acknowledge_commit = {no|recv|fsync|replay}
>>>
>>> instead of
>>>
>>> replication_mode = {async|recv|fsync|replay}
>>
>> Okay, I'll change the patch accordingly.
>
> For what it's worth, I think replication_mode is a lot clearer.
> Acknowledge_commit sounds like it would do something similar to
> asynchronous_commit.

I agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Joshua Tolley <eggyknap(at)gmail(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-27 03:36:32
Message-ID:	4c4e544a.0d87970a.7a9b.ffff8e20@mx.google.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 22, 2010 at 10:37:12AM +0200, Yeb Havinga wrote:
> Fujii Masao wrote:
> Initially I also expected the quorum to behave like described by
> Aidan/option 2. Also, IMHO the name "quorom" is a bit short, like having
> "maximum" but not saying a max_something.
>
> quorum_min_sync_standbys
> quorum_max_sync_standbys

Perhaps I'm hijacking the wrong thread for this, but I wonder if the quorum
idea is really the best thing for us. I've been thinking about Oracle's way of
doing things[1]. In short, there are three different modes: availability,
performance, and protection. "Protection" appears to mean that at least one
standby has applied the log; "availability" means at least one standby has
received the log info (it doesn't specify whether that info has been fsynced
or applied, but presumably does not mean "applied", since it's distinct from
"protection" mode); "performance" means replication is asynchronous. I'm not
sure this method is perfect, but it might be simpler than the quorum behavior
that has been considered, and adequate for actual use cases.

[1]
http://download.oracle.com/docs/cd/B28359_01/server.111/b28294/protection.htm#SBYDB02000
alternatively, http://is.gd/dLkq4

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Joshua Tolley <eggyknap(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-27 04:41:10
Message-ID:	AANLkTiknQdTsM8rCiQ88=LNj-toTssxrLN0D0Rcus6T3@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 27, 2010 at 12:36 PM, Joshua Tolley <eggyknap(at)gmail(dot)com> wrote:
> Perhaps I'm hijacking the wrong thread for this, but I wonder if the quorum
> idea is really the best thing for us. I've been thinking about Oracle's way of
> doing things[1]. In short, there are three different modes: availability,
> performance, and protection. "Protection" appears to mean that at least one
> standby has applied the log; "availability" means at least one standby has
> received the log info (it doesn't specify whether that info has been fsynced
> or applied, but presumably does not mean "applied", since it's distinct from
> "protection" mode); "performance" means replication is asynchronous. I'm not
> sure this method is perfect, but it might be simpler than the quorum behavior
> that has been considered, and adequate for actual use cases.

In my case, I'd like to set up one synchronous standby on the near rack for
high-availability, and one asynchronous standby on the remote site for disaster
recovery. Can Oracle's way cover the case?

"availability" mode with two standbys might create a sort of similar situation.
That is, since the ACK from the near standby arrives in first, the near standby
acts synchronous and the remote one does asynchronous. But the ACK from the
remote standby can arrive in first, so it's not guaranteed that the near standby
has received the log info before transaction commit returns a "success" to the
client. In this case, we have to failover to the remote standby even if it's not
under control of a clusterware. This is a problem for me.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Marko Tiikkaja <marko(dot)tiikkaja(at)cs(dot)helsinki(dot)fi>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Subject:	Re: Synchronous replication
Date:	2010-07-27 04:43:35
Message-ID:	AANLkTikqUvW6GD825uWQRqkJ9rCLnJW4iXgk0-oy-nz2@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 26, 2010 at 8:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Jul 26, 2010 at 6:48 AM, Marko Tiikkaja
> <marko(dot)tiikkaja(at)cs(dot)helsinki(dot)fi> wrote:
>> On 7/26/10 1:44 PM +0300, Fujii Masao wrote:
>>>
>>> On Mon, Jul 26, 2010 at 6:36 PM, Yeb Havinga<yebhavinga(at)gmail(dot)com> wrote:
>>>>
>>>> I wasn't entirely clear. My suggestion was to have only
>>>>
>>>> acknowledge_commit = {no|recv|fsync|replay}
>>>>
>>>> instead of
>>>>
>>>> replication_mode = {async|recv|fsync|replay}
>>>
>>> Okay, I'll change the patch accordingly.
>>
>> For what it's worth, I think replication_mode is a lot clearer.
>> Acknowledge_commit sounds like it would do something similar to
>> asynchronous_commit.
>
> I agree.

As the result of the vote, I'll leave the parameter "replication_mode"
as it is.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Subject:	Re: Synchronous replication
Date:	2010-07-27 05:28:29
Message-ID:	AANLkTimfxY2QJPbrkoobrMLvGP8tGmT-_7ObPKbbWaPC@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 21, 2010 at 4:36 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> I was actually hoping to see a patch for these things first, before any of
>> the synchronous replication stuff. Eliminating the polling loops is
>> important, latency will be laughable otherwise, and it will help the
>> synchronous case too.
>
> At first, note that the poll loop in the backend and walreceiver doesn't
> exist without synchronous replication stuff.
>
> Yeah, I'll start with the change of the poll loop in the walsender. I'm
> thinking that we should make the backend signal the walsender to send the
> outstanding WAL immediately as the previous synchronous replication patch
> I submitted in the past year did. I use the signal here because walsender
> needs to wait for the request from the backend and the ack message from
> the standby *concurrently* in synchronous replication. If we use the
> semaphore instead of the signal, the walsender would not be able to
> respond the ack immediately, which also degrades the performance.
>
> The problem of this idea is that signal can be sent per transaction commit.
> I'm not sure if this frequent signaling really harms the performance of
> replication. BTW, when I benchmarked the previous synchronous replication
> patch based on the idea, AFAIR the result showed no impact of the
> signaling. But... Thought? Do you have another better idea?

The attached patch changes the backend so that it signals walsender to
wake up from the sleep and send WAL immediately. It doesn't include any
other synchronous replication stuff.

The signal is sent right after a COMMIT, PREPARE TRANSACTION,
COMMIT PREPARED or ABORT PREPARED record has been fsync'd.

To suppress redundant signaling, I added the flag which indicates whether
walsender is ready for sending WAL up to the currently-fsync'd location.
Only when the flag is false, the backend sets it to true and sends the
signal to walsender. When the flag is true, the signal doesn't need to be
sent. The flag is set to false right before walsender sends WAL.

The code is also available in my git repository:
git://git.postgresql.org/git/users/fujii/postgres.git
branch: wakeup-walsnd

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment	Content-Type	Size
change_poll_loop_in_walsender_0727.patch	application/octet-stream	8.1 KB

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Marko Tiikkaja <marko(dot)tiikkaja(at)cs(dot)helsinki(dot)fi>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Subject:	Re: Synchronous replication
Date:	2010-07-27 08:42:02
Message-ID:	4C4E9BDA.5070202@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> On Mon, Jul 26, 2010 at 8:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> On Mon, Jul 26, 2010 at 6:48 AM, Marko Tiikkaja
>> <marko(dot)tiikkaja(at)cs(dot)helsinki(dot)fi> wrote:
>>
>>> On 7/26/10 1:44 PM +0300, Fujii Masao wrote:
>>>
>>>> On Mon, Jul 26, 2010 at 6:36 PM, Yeb Havinga<yebhavinga(at)gmail(dot)com> wrote:
>>>>
>>>>> I wasn't entirely clear. My suggestion was to have only
>>>>>
>>>>> acknowledge_commit = {no|recv|fsync|replay}
>>>>>
>>>>> instead of
>>>>>
>>>>> replication_mode = {async|recv|fsync|replay}
>>>>>
>>>> Okay, I'll change the patch accordingly.
>>>>
>>> For what it's worth, I think replication_mode is a lot clearer.
>>> Acknowledge_commit sounds like it would do something similar to
>>> asynchronous_commit.
>>>
>> I agree.
>>
>
> As the result of the vote, I'll leave the parameter "replication_mode"
> as it is.
>
I'd like to bring forward another suggestion (please tell me when it is
becoming spam). My feeling about replication_mode as is, is that is says
in the same parameter something about async or sync, as well as, if
sync, which method of feedback to the master. OTOH having two parameters
would need documentation that the feedback method may only be set if the
replication_mode was sync, as well as checks. So it is actually good to
have it all in one parameter

But somehow the shoe pinches, because async feels different from the
other three parameters. There is a way to move async out of the enumeration:

synchronous_replication_mode = off | recv | fsync | replay

This also looks a bit like the "synchronous_replication = N # similar in
name to synchronous_commit" Simon Riggs proposed in
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01418.php

regards,
Yeb Havinga

PS: Please bear with me, I thought a bit about a way to make clear what
deduction users must make when figuring out if the replication mode is
synchronous. That question might be important when counting 'which
servers are the synchronous standbys' to debug quorum settings.

replication_mode

from the assumption !async -> sync
and !async -> recv|fsync|replay
to infer recv|fsync|replay -> synchronous_replication.

synchronous_replication_mode

from the assumption !off -> on
and !off -> recv|fsync|replay
to infer recv|fsync|replay -> synchronous_replication.

I think the last one is easier made by humans, since everybody will make
the !off-> on assumption, but not the !async -> sync without having that
verified in the documentation.

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Joshua Tolley <eggyknap(at)gmail(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-27 09:11:44
Message-ID:	4C4EA2D0.2040309@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joshua Tolley wrote:
> Perhaps I'm hijacking the wrong thread for this, but I wonder if the quorum
> idea is really the best thing for us.
For reference: it appeared in a long thread a while ago
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01226.php.
> In short, there are three different modes: availability,
> performance, and protection. "Protection" appears to mean that at least one
> standby has applied the log; "availability" means at least one standby has
> received the log info
>
Maybe we could do both, by describing use cases along the availability,
performance and protection setups in the documentation and how they
would be reflected with the standby related parameters.

regards,
Yeb Havinga

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-27 10:39:15
Message-ID:	4C4EB753.1010706@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> The attached patch changes the backend so that it signals walsender to
> wake up from the sleep and send WAL immediately. It doesn't include any
> other synchronous replication stuff.
>
Hello Fujii,

I noted the changes in XlogSend where instead of *caughtup = true/false
it now returns !MyWalSnd->sndrqst. That value is initialized to false in
that procedure and it cannot be changed to true during execution of that
procedure, or can it?

regards,
Yeb Havinga

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-27 11:29:29
Message-ID:	AANLkTinEtBRVkD671vCnHjpBSBVBC2niKBoCV7hYphcX@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 27, 2010 at 7:39 PM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
> Fujii Masao wrote:
>>
>> The attached patch changes the backend so that it signals walsender to
>> wake up from the sleep and send WAL immediately. It doesn't include any
>> other synchronous replication stuff.
>>
>
> Hello Fujii,

Thanks for the review!

> I noted the changes in XlogSend where instead of *caughtup = true/false it
> now returns !MyWalSnd->sndrqst. That value is initialized to false in that
> procedure and it cannot be changed to true during execution of that
> procedure, or can it?

That value is set to true in WalSndWakeup(). If WalSndWakeup() is called
after initialization of that value in XLogSend(), *caughtup is set to false.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Marko Tiikkaja <marko(dot)tiikkaja(at)cs(dot)helsinki(dot)fi>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Boszormenyi Zoltan <zb(at)cybertec(dot)at>
Subject:	Re: Synchronous replication
Date:	2010-07-27 11:42:44
Message-ID:	AANLkTi=wOf6SjdOEscBCMqkMr7zhXcCk+pfQZj-433Ax@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 27, 2010 at 5:42 PM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
> I'd like to bring forward another suggestion (please tell me when it is
> becoming spam). My feeling about replication_mode as is, is that is says in
> the same parameter something about async or sync, as well as, if sync, which
> method of feedback to the master. OTOH having two parameters would need
> documentation that the feedback method may only be set if the
> replication_mode was sync, as well as checks. So it is actually good to have
> it all in one parameter
>
> But somehow the shoe pinches, because async feels different from the other
> three parameters. There is a way to move async out of the enumeration:
>
> synchronous_replication_mode = off | recv | fsync | replay

ISTM that we need to get more feedback from users to determine which
is the best. So, how about leaving the parameter as it is and revisiting
this topic later? Since it's not difficult to change the parameter later,
we will not regret even if we delay that determination.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-27 11:48:52
Message-ID:	4C4EC7A4.1060009@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
>> I noted the changes in XlogSend where instead of *caughtup = true/false it
>> now returns !MyWalSnd->sndrqst. That value is initialized to false in that
>> procedure and it cannot be changed to true during execution of that
>> procedure, or can it?
>>
>
> That value is set to true in WalSndWakeup(). If WalSndWakeup() is called
> after initialization of that value in XLogSend(), *caughtup is set to false.
>
Ah, so it can be changed by another backend process.

Another question:

Is there a reason not to send the signal in XlogFlush itself, so it
would be called at

CreateCheckPoint(), EndPrepare(), FlushBuffer(),
RecordTransactionAbortPrepared(), RecordTransactionCommit(),
RecordTransactionCommitPrepared(), RelationTruncate(),
SlruPhysicalWritePage(), write_relmap_file(), WriteTruncateXlogRec(),
and xact_redo_commit().

regards,
Yeb Havinga

From:	Joshua Tolley <eggyknap(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-27 13:12:33
Message-ID:	4c4edb4b.08958e0a.7821.1e0c@mx.google.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 27, 2010 at 01:41:10PM +0900, Fujii Masao wrote:
> On Tue, Jul 27, 2010 at 12:36 PM, Joshua Tolley <eggyknap(at)gmail(dot)com> wrote:
> > Perhaps I'm hijacking the wrong thread for this, but I wonder if the quorum
> > idea is really the best thing for us. I've been thinking about Oracle's way of
> > doing things[1]. In short, there are three different modes: availability,
> > performance, and protection. "Protection" appears to mean that at least one
> > standby has applied the log; "availability" means at least one standby has
> > received the log info (it doesn't specify whether that info has been fsynced
> > or applied, but presumably does not mean "applied", since it's distinct from
> > "protection" mode); "performance" means replication is asynchronous. I'm not
> > sure this method is perfect, but it might be simpler than the quorum behavior
> > that has been considered, and adequate for actual use cases.
>
> In my case, I'd like to set up one synchronous standby on the near rack for
> high-availability, and one asynchronous standby on the remote site for disaster
> recovery. Can Oracle's way cover the case?

I don't think it can support the case you're interested in, though I'm not
terribly expert on it. I'm definitely not arguing for the syntax Oracle uses,
or something similar; I much prefer the flexibility we're proposing, and agree
with Yeb Havinga in another email who suggests we spell out in documentation
some recipes for achieving various possible scenarios given whatever GUCs we
settle on.

> "availability" mode with two standbys might create a sort of similar situation.
> That is, since the ACK from the near standby arrives in first, the near standby
> acts synchronous and the remote one does asynchronous. But the ACK from the
> remote standby can arrive in first, so it's not guaranteed that the near standby
> has received the log info before transaction commit returns a "success" to the
> client. In this case, we have to failover to the remote standby even if it's not
> under control of a clusterware. This is a problem for me.

My concern is that in a quorum system, if the quorum number is less than the
total number of replicas, there's no way to know *which* replicas composed the
quorum for any given transaction, so we can't know which servers to fail to if
the master dies. This isn't different from Oracle, where it looks like
essentially the "quorum" value is always 1. Your scenario shows that all
replicas are not created equal, and that sometimes we'll be interested in WAL
getting committed on a specific subset of the available servers. If I had two
nearby replicas called X and Y, and one at a remote site called Z, for
instance, I'd set quorum to 2, but really I'd want to say "wait for server X
and Y before committing, but don't worry about Z".

I have no idea how to set up our GUCs to encode a situation like that :)

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-27 13:17:15
Message-ID:	AANLkTinxi4NCR0xa5tQ=v2uddearnwhk+k1f9GZqQb2r@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 27, 2010 at 8:48 PM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
> Is there a reason not to send the signal in XlogFlush itself, so it would be
> called at
>
> CreateCheckPoint(), EndPrepare(), FlushBuffer(),
> RecordTransactionAbortPrepared(), RecordTransactionCommit(),
> RecordTransactionCommitPrepared(), RelationTruncate(),
> SlruPhysicalWritePage(), write_relmap_file(), WriteTruncateXlogRec(), and
> xact_redo_commit().

Yes, it's because there is no need to send WAL immediately in other
than the following functions:

* EndPrepare()
* RecordTransactionAbortPrepared()
* RecordTransactionCommit()
* RecordTransactionCommitPrepared()

Some functions call XLogFlush() to follow the basic WAL rule. In the
standby, WAL records are always flushed to disk prior to any corresponding
data-file change. So, we don't need to replicate the result of XLogFlush()
immediately for the WAL rule.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Joshua Tolley <eggyknap(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-27 13:53:45
Message-ID:	AANLkTik-YzQ7pZ+zT0_3MPtC3sjocPZZS9srbuNxmYOY@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 27, 2010 at 10:12 PM, Joshua Tolley <eggyknap(at)gmail(dot)com> wrote:
> I don't think it can support the case you're interested in, though I'm not
> terribly expert on it. I'm definitely not arguing for the syntax Oracle uses,
> or something similar; I much prefer the flexibility we're proposing, and agree
> with Yeb Havinga in another email who suggests we spell out in documentation
> some recipes for achieving various possible scenarios given whatever GUCs we
> settle on.

Agreed. I'll add it to my TODO list.

> My concern is that in a quorum system, if the quorum number is less than the
> total number of replicas, there's no way to know *which* replicas composed the
> quorum for any given transaction, so we can't know which servers to fail to if
> the master dies.

What about checking the current WAL receive location of each standby by
using pg_last_xlog_receive_location()? The standby which has the newest
location should be failed over to.

> This isn't different from Oracle, where it looks like
> essentially the "quorum" value is always 1. Your scenario shows that all
> replicas are not created equal, and that sometimes we'll be interested in WAL
> getting committed on a specific subset of the available servers. If I had two
> nearby replicas called X and Y, and one at a remote site called Z, for
> instance, I'd set quorum to 2, but really I'd want to say "wait for server X
> and Y before committing, but don't worry about Z".
>
> I have no idea how to set up our GUCs to encode a situation like that :)

Yeah, quorum commit alone cannot cover that situation. I think that
current approach (i.e., quorum commit plus replication mode per standby)
would cover that. In your example, you can choose "recv", "fsync" or
"replay" as replication_mode in X and Y, and choose "async" in Z.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Joshua Tolley <eggyknap(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-27 14:57:03
Message-ID:	4c4ef3d3.0541730a.3195.352d@mx.google.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 27, 2010 at 10:53:45PM +0900, Fujii Masao wrote:
> On Tue, Jul 27, 2010 at 10:12 PM, Joshua Tolley <eggyknap(at)gmail(dot)com> wrote:
> > My concern is that in a quorum system, if the quorum number is less than the
> > total number of replicas, there's no way to know *which* replicas composed the
> > quorum for any given transaction, so we can't know which servers to fail to if
> > the master dies.
>
> What about checking the current WAL receive location of each standby by
> using pg_last_xlog_receive_location()? The standby which has the newest
> location should be failed over to.

That makes sense. Thanks.

> > This isn't different from Oracle, where it looks like
> > essentially the "quorum" value is always 1. Your scenario shows that all
> > replicas are not created equal, and that sometimes we'll be interested in WAL
> > getting committed on a specific subset of the available servers. If I had two
> > nearby replicas called X and Y, and one at a remote site called Z, for
> > instance, I'd set quorum to 2, but really I'd want to say "wait for server X
> > and Y before committing, but don't worry about Z".
> >
> > I have no idea how to set up our GUCs to encode a situation like that :)
>
> Yeah, quorum commit alone cannot cover that situation. I think that
> current approach (i.e., quorum commit plus replication mode per standby)
> would cover that. In your example, you can choose "recv", "fsync" or
> "replay" as replication_mode in X and Y, and choose "async" in Z.

Clearly I need to read through the GUCs and docs better. I'll try to keep
quiet until that's finished :)

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Joshua Tolley <eggyknap(at)gmail(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-07-27 16:58:03
Message-ID:	7063F342-9066-43C2-9400-53911B1033B2@hi-media.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Le 27 juil. 2010 à 15:12, Joshua Tolley <eggyknap(at)gmail(dot)com> a écrit :
> My concern is that in a quorum system, if the quorum number is less than the
> total number of replicas, there's no way to know *which* replicas composed the
> quorum for any given transaction, so we can't know which servers to fail to if
> the master dies. This isn't different from Oracle, where it looks like
> essentially the "quorum" value is always 1. Your scenario shows that all
> replicas are not created equal, and that sometimes we'll be interested in WAL
> getting committed on a specific subset of the available servers. If I had two
> nearby replicas called X and Y, and one at a remote site called Z, for
> instance, I'd set quorum to 2, but really I'd want to say "wait for server X
> and Y before committing, but don't worry about Z".
>
> I have no idea how to set up our GUCs to encode a situation like that :)

You make it so that Z does not take a vote, by setting it async.

Regards,
--
dim

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Joshua Tolley <eggyknap(at)gmail(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-01 06:11:25
Message-ID:	4C55100D.5040902@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 27/07/10 16:12, Joshua Tolley wrote:
> My concern is that in a quorum system, if the quorum number is less than the
> total number of replicas, there's no way to know *which* replicas composed the
> quorum for any given transaction, so we can't know which servers to fail to if
> the master dies.

In fact, it's possible for one standby to sync up to X, then disconnect
and reconnect, and have the master count it second time in the quorum.
Especially if the master doesn't notice that the standby disconnected,
e.g a network problem.

I don't think any of this quorum stuff makes much sense without
explicitly registering standbys in the master.

That would also solve the fuzziness with wal_keep_segments - if the
master knew what standbys exist, it could keep track of how far each
standby has received WAL, and keep just enough WAL for each standby to
catch up.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Joshua Tolley <eggyknap(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-01 12:30:08
Message-ID:	AANLkTineufXi05mZSq9t2LvxzdPns8sOHdGw7p1b9QCw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Aug 1, 2010 at 7:11 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> In fact, it's possible for one standby to sync up to X, then disconnect and
> reconnect, and have the master count it second time in the quorum.
> Especially if the master doesn't notice that the standby disconnected, e.g a
> network problem.
>
> I don't think any of this quorum stuff makes much sense without explicitly
> registering standbys in the master.

This doesn't have to be done manually. The streaming protocol could
include the standby sending its system id to the master. The master
could just keep a list of system ids with the last record they've been
sent and the last they've confirmed receipt, fsync, application,
whatever the protocol covers. If the same system reconnects it just
overwrites the existing data for that system id.

--
greg

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-01 12:51:57
Message-ID:	AANLkTikhSNKFO+EcgiDs+4N7SSuU1RKpANm1EfuW4uzS@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Aug 1, 2010 at 8:30 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Sun, Aug 1, 2010 at 7:11 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> In fact, it's possible for one standby to sync up to X, then disconnect and
>> reconnect, and have the master count it second time in the quorum.
>> Especially if the master doesn't notice that the standby disconnected, e.g a
>> network problem.
>>
>> I don't think any of this quorum stuff makes much sense without explicitly
>> registering standbys in the master.
>
> This doesn't have to be done manually. The streaming protocol could
> include the standby sending its system id to the master. The master
> could just keep a list of system ids with the last record they've been
> sent and the last they've confirmed receipt, fsync, application,
> whatever the protocol covers. If the same system reconnects it just
> overwrites the existing data for that system id.

That seems entirely too clever. Where are you going to store this
data? What if you want to clean out the list?

I've felt from the beginning that the idea of doing synchronous
replication without having an explicit notion of what standbys are out
there was not on very sound footing, and I think the difficulties of
making quorum commit work properly are only further evidence of that.
Much has been made of the notion of "wait for N votes, but allow
standbys to explicitly give up their vote", but that's still not fully
general - for example, you can't implement A && (B || C).

Perhaps someone will claim that nobody wants to do that anyway (which
I don't believe, BTW), but even in simpler cases it would be nicer to
have an explicit policy rather than - in effect - inferring a policy
from a soup of GUC settings. For example, if you want one synchronous
standby (A) and two asynchronous standbys (B and C). You can say
quorum=1 on the master and then configure vote=1 on A and vote=0 on B
and C, but now you have to look at four machines to figure out what
the policy is, and a change on any one of those machines can break it.
ISTM that if you can just write synchronous_standbys=A on the master,
that's a whole lot more clear and less error-prone.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-02 02:08:25
Message-ID:	AANLkTi=NzTxgjrEfL7so99eHknxrgb9FQQNEVoyd7x+V@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Aug 1, 2010 at 9:30 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> This doesn't have to be done manually.

Agreed, if we register standbys in the master.

> The streaming protocol could
> include the standby sending its system id to the master. The master
> could just keep a list of system ids with the last record they've been
> sent and the last they've confirmed receipt, fsync, application,
> whatever the protocol covers. If the same system reconnects it just
> overwrites the existing data for that system id.

Since every standby has the same system id, we cannot distinguish
them by that id. ISTM that the master should assign the unique id
for each standby, and they should save it in pg_control.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-02 02:26:10
Message-ID:	AANLkTimC5vRyaRUMRgAxX6qi5oZ2MsVx4nWO=f0_P55E@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Aug 1, 2010 at 10:08 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Sun, Aug 1, 2010 at 9:30 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>> This doesn't have to be done manually.
>
> Agreed, if we register standbys in the master.
>
>> The streaming protocol could
>> include the standby sending its system id to the master. The master
>> could just keep a list of system ids with the last record they've been
>> sent and the last they've confirmed receipt, fsync, application,
>> whatever the protocol covers. If the same system reconnects it just
>> overwrites the existing data for that system id.
>
> Since every standby has the same system id, we cannot distinguish
> them by that id. ISTM that the master should assign the unique id
> for each standby, and they should save it in pg_control.

Another option might be to let the user name them.

standby_name='near'
standby_name='far1'
standby_name='far2'

...or whatever.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Joshua Tolley <eggyknap(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-02 08:45:40
Message-ID:	AANLkTi=JwopeetrtUP4czTHuzLfzz7sJrSY2c6HpVCDh@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Aug 1, 2010 at 3:11 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I don't think any of this quorum stuff makes much sense without explicitly
> registering standbys in the master.

I'm not sure if this is a good idea. This requires users to do more
manual operations than ever when setting up the replication; assign
unique name (or ID) to each standby, register them in the master,
specify the names in each recovery.conf (or elsewhere), and remove
the registration from the master when getting rid of the standby.

But this is similar to the way of MySQL replication setup, so some
people (excluding me) may be familiar with it.

> That would also solve the fuzziness with wal_keep_segments - if the master
> knew what standbys exist, it could keep track of how far each standby has
> received WAL, and keep just enough WAL for each standby to catch up.

What if the registered standby stays down for a long time?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-02 09:02:34
Message-ID:	AANLkTimNM65jc7TPCag=hUGrD=CHH60jV1XZeXrnNSsW@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Aug 1, 2010 at 9:51 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Perhaps someone will claim that nobody wants to do that anyway (which
> I don't believe, BTW), but even in simpler cases it would be nicer to
> have an explicit policy rather than - in effect - inferring a policy
> from a soup of GUC settings. For example, if you want one synchronous
> standby (A) and two asynchronous standbys (B and C). You can say
> quorum=1 on the master and then configure vote=1 on A and vote=0 on B
> and C, but now you have to look at four machines to figure out what
> the policy is, and a change on any one of those machines can break it.
> ISTM that if you can just write synchronous_standbys=A on the master,
> that's a whole lot more clear and less error-prone.

Some standbys may become master later by failover. So we would
need to write something like synchronous_standbys=A on not only
current one master but also those standbys. Changing
synchronous_standbys would require change on all those servers.
Or the master should replicate even that change to the standbys?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-02 10:53:58
Message-ID:	AANLkTimqo5CUum0L1JRiRmaJPd3LQ-wUgJ+o5n=HWXSm@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 2, 2010 at 5:02 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Sun, Aug 1, 2010 at 9:51 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Perhaps someone will claim that nobody wants to do that anyway (which
>> I don't believe, BTW), but even in simpler cases it would be nicer to
>> have an explicit policy rather than - in effect - inferring a policy
>> from a soup of GUC settings. For example, if you want one synchronous
>> standby (A) and two asynchronous standbys (B and C). You can say
>> quorum=1 on the master and then configure vote=1 on A and vote=0 on B
>> and C, but now you have to look at four machines to figure out what
>> the policy is, and a change on any one of those machines can break it.
>> ISTM that if you can just write synchronous_standbys=A on the master,
>> that's a whole lot more clear and less error-prone.
>
> Some standbys may become master later by failover. So we would
> need to write something like synchronous_standbys=A on not only
> current one master but also those standbys. Changing
> synchronous_standbys would require change on all those servers.
> Or the master should replicate even that change to the standbys?

Let's not get *the manner of specifying the policy* confused with *the
need to update the policy when the master changes*. It doesn't seem
likely you would want the same value for synchronous_standbys on all
your machines. In the most common configuration, you'd probably have:

on A: synchronous_standbys=B
on B: synchronous_standbys=A

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-02 11:06:28
Message-ID:	AANLkTimVz-849Fn1xCc=OCmg9YMXsnLxS-2Qy3ykzZPG@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 2, 2010 at 7:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Let's not get *the manner of specifying the policy* confused with *the
> need to update the policy when the master changes*. It doesn't seem
> likely you would want the same value for synchronous_standbys on all
> your machines. In the most common configuration, you'd probably have:
>
> on A: synchronous_standbys=B
> on B: synchronous_standbys=A

Oh, true. But, what if we have another synchronous standby called C?
We specify the policy as follows?:

on A: synchronous_standbys=B,C
on B: synchronous_standbys=A,C
on C: synchronous_standbys=A,B

We would need to change the setting on both A and B when we want to
change the name of the third standby from C to D, for example. No?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-02 11:32:05
Message-ID:	AANLkTi=jdEniuS9fpJahW_FnLjD_u_98R3wDe0ZJ=z9d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 2, 2010 at 7:06 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Mon, Aug 2, 2010 at 7:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Let's not get *the manner of specifying the policy* confused with *the
>> need to update the policy when the master changes*. It doesn't seem
>> likely you would want the same value for synchronous_standbys on all
>> your machines. In the most common configuration, you'd probably have:
>>
>> on A: synchronous_standbys=B
>> on B: synchronous_standbys=A
>
> Oh, true. But, what if we have another synchronous standby called C?
> We specify the policy as follows?:
>
> on A: synchronous_standbys=B,C
> on B: synchronous_standbys=A,C
> on C: synchronous_standbys=A,B
>
> We would need to change the setting on both A and B when we want to
> change the name of the third standby from C to D, for example. No?

Sure. If you give the standbys names, then if people change the
names, they'll have to update their configuration. But I can't see
that as an argument against doing it. You can remove the possibility
that someone will have a hassle if they rename a server by not
allowing them to give it a name in the first place, but that doesn't
seem like a win from a usability perspective.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-02 12:03:40
Message-ID:	AANLkTi=LgM5XTg+Mr3FcR26duGFeFE=AGdOKCWEiW=Nz@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 2, 2010 at 8:32 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Sure. If you give the standbys names, then if people change the
> names, they'll have to update their configuration. But I can't see
> that as an argument against doing it. You can remove the possibility
> that someone will have a hassle if they rename a server by not
> allowing them to give it a name in the first place, but that doesn't
> seem like a win from a usability perspective.

I'm just comparing your idea (i.e., set synchronous_standbys on
each possible master) with my idea (i.e., set replication_mode on
each standby). Though your idea has the advantage described in the
following post, it seems to make the setup of the standbys more
complicated, as I described. So I'm trying to generate better idea.
http://archives.postgresql.org/pgsql-hackers/2010-08/msg00007.php

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-02 12:57:47
Message-ID:	4C56C0CB.2020407@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> On Mon, Aug 2, 2010 at 7:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> Let's not get *the manner of specifying the policy* confused with *the
>> need to update the policy when the master changes*. It doesn't seem
>> likely you would want the same value for synchronous_standbys on all
>> your machines. In the most common configuration, you'd probably have:
>>
>> on A: synchronous_standbys=B
>> on B: synchronous_standbys=A
>>
>
> Oh, true. But, what if we have another synchronous standby called C?
> We specify the policy as follows?:
>
> on A: synchronous_standbys=B,C
> on B: synchronous_standbys=A,C
> on C: synchronous_standbys=A,B
>
> We would need to change the setting on both A and B when we want to
> change the name of the third standby from C to D, for example. No?
>
What if the master is named as well in the 'pool of servers that are in
sync'? In the scenario above this pool would be A,B,C. Working with this
concept has as benefit that the setting can be copied to all other
servers as well, and is invariant under any number of failures or
switchovers. The same could also hold for quorum expressions like A &&
(B || C), if A,B,C are either master or standby.

I initially though that once the definitions could be the same on all
servers, having them in a system catalog would be a good thing. However
that'd propably hard to setup, and also in the case of failures during
change of the parameters it could become very messy.

regards,
Yeb Havinga

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-02 13:02:34
Message-ID:	AANLkTimJoPc8FQskQwvZbgfBfx2Op4UWbgCqjCwRjrVN@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 2, 2010 at 8:57 AM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
> Fujii Masao wrote:
>>
>> On Mon, Aug 2, 2010 at 7:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>>>
>>> Let's not get *the manner of specifying the policy* confused with *the
>>> need to update the policy when the master changes*. It doesn't seem
>>> likely you would want the same value for synchronous_standbys on all
>>> your machines. In the most common configuration, you'd probably have:
>>>
>>> on A: synchronous_standbys=B
>>> on B: synchronous_standbys=A
>>>
>>
>> Oh, true. But, what if we have another synchronous standby called C?
>> We specify the policy as follows?:
>>
>> on A: synchronous_standbys=B,C
>> on B: synchronous_standbys=A,C
>> on C: synchronous_standbys=A,B
>>
>> We would need to change the setting on both A and B when we want to
>> change the name of the third standby from C to D, for example. No?
>>
>
> What if the master is named as well in the 'pool of servers that are in
> sync'? In the scenario above this pool would be A,B,C. Working with this
> concept has as benefit that the setting can be copied to all other servers
> as well, and is invariant under any number of failures or switchovers. The
> same could also hold for quorum expressions like A && (B || C), if A,B,C are
> either master or standby.
>
> I initially though that once the definitions could be the same on all
> servers, having them in a system catalog would be a good thing. However
> that'd propably hard to setup, and also in the case of failures during
> change of the parameters it could become very messy.

Yeah, I think this information has to be stored either in GUCs or in a
flat-file somewhere. Putting it in a system catalog will cause major
problems when trying to get a down system back up, I think.

I suspect that for complex setups, people will need to use some kind
of cluster-ware to update the settings as nodes go up and down. But I
think it will still be simpler if the nodes are named.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-03 15:35:47
Message-ID:	4C583753.70709@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 27/07/10 13:29, Fujii Masao wrote:
> On Tue, Jul 27, 2010 at 7:39 PM, Yeb Havinga<yebhavinga(at)gmail(dot)com> wrote:
>> Fujii Masao wrote:
>> I noted the changes in XlogSend where instead of *caughtup = true/false it
>> now returns !MyWalSnd->sndrqst. That value is initialized to false in that
>> procedure and it cannot be changed to true during execution of that
>> procedure, or can it?
>
> That value is set to true in WalSndWakeup(). If WalSndWakeup() is called
> after initialization of that value in XLogSend(), *caughtup is set to false.

There's some race conditions with the signaling. If another process
finishes XLOG flush and sends the signal when a walsender has just
finished one iteration of its main loop, walsender will reset
xlogsend_requested and go to sleep. It should not sleep but send the
pending WAL immediately.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Joshua Tolley <eggyknap(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-04 13:38:16
Message-ID:	4C596D48.5030400@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/08/10 11:45, Fujii Masao wrote:
> On Sun, Aug 1, 2010 at 3:11 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> I don't think any of this quorum stuff makes much sense without explicitly
>> registering standbys in the master.
>
> I'm not sure if this is a good idea. This requires users to do more
> manual operations than ever when setting up the replication; assign
> unique name (or ID) to each standby, register them in the master,
> specify the names in each recovery.conf (or elsewhere), and remove
> the registration from the master when getting rid of the standby.
>
> But this is similar to the way of MySQL replication setup, so some
> people (excluding me) may be familiar with it.
>
>> That would also solve the fuzziness with wal_keep_segments - if the master
>> knew what standbys exist, it could keep track of how far each standby has
>> received WAL, and keep just enough WAL for each standby to catch up.
>
> What if the registered standby stays down for a long time?

Then you risk running out of disk space. Similar to having an archive
command that fails for some reason.

That's one reason the registration should not be too automatic - there
is serious repercussions if the standby just disappears. If the standby
is a synchronous one, the master will stop committing or delay
acknowledging commits, depending on the configuration, and the master
needs to keep extra WAL around.

Of course, we can still support unregistered standbys, with the current
semantics.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-05 10:40:32
Message-ID:	AANLkTi=-x0C0=1a9gnwzLT=nAtftTnQ+FXuNuP8CUbms@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 4, 2010 at 12:35 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> There's some race conditions with the signaling. If another process finishes
> XLOG flush and sends the signal when a walsender has just finished one
> iteration of its main loop, walsender will reset xlogsend_requested and go
> to sleep. It should not sleep but send the pending WAL immediately.

Yep. To avoid that race condition, xlogsend_requested should be reset to
false after sleep and before calling XLogSend(). I attached the updated
version of the patch.

Of course, the code is also available in my git repository:
git://git.postgresql.org/git/users/fujii/postgres.git
branch: wakeup-walsnd

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment	Content-Type	Size
change_poll_loop_in_walsender_0805.patch	application/octet-stream	8.1 KB

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Joshua Tolley <eggyknap(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-05 14:14:06
Message-ID:	AANLkTi=T-NGoVQ3Ugp=KsJZSmJBLuwjKvTHz+mrV++TC@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 4, 2010 at 10:38 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Then you risk running out of disk space. Similar to having an archive
> command that fails for some reason.
>
> That's one reason the registration should not be too automatic - there is
> serious repercussions if the standby just disappears. If the standby is a
> synchronous one, the master will stop committing or delay acknowledging
> commits, depending on the configuration, and the master needs to keep extra
> WAL around.

Umm... in addition to registration of each standby, I think we should allow
users to set the upper limit of the number of WAL files kept in pg_xlog to
avoid running out of disk space. If it exceeds the upper limit, the master
disconnects too old standbys from the cluster and removes all the WAL files
not required for current connected standbys. If you don't want any standby
to disappear unexpectedly because of the upper limit, you can set it to 0
(= no limit).

I'm thinking to make users register and unregister each standbys via SQL
functions like register_standby() and unregister_standby():

void register_standby(standby_name text, streaming_start_lsn text)
void unregister_standby(standby_name text)

Note that standby_name should be specified in recovery.conf of each
standby.

By using them we can easily specify which WAL files are unremovable because
of new standby when taking the base backup for it as follows:

SELECT register_standby('foo', pg_start_backup())

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Joshua Tolley <eggyknap(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-09 20:58:27
Message-ID:	4C606BF3.5020308@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 05/08/10 17:14, Fujii Masao wrote:
> I'm thinking to make users register and unregister each standbys via SQL
> functions like register_standby() and unregister_standby():

The register/unregister facility should be accessible from the streaming
replication connection, so that you don't need to connect to any
particular database in addition to the streaming connection.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Joshua Tolley <eggyknap(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-09 21:00:47
Message-ID:	4C606C7F.3060308@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/08/10 15:30, Greg Stark wrote:
> On Sun, Aug 1, 2010 at 7:11 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> I don't think any of this quorum stuff makes much sense without explicitly
>> registering standbys in the master.
>
> This doesn't have to be done manually. The streaming protocol could
> include the standby sending its system id to the master. The master
> could just keep a list of system ids with the last record they've been
> sent and the last they've confirmed receipt, fsync, application,
> whatever the protocol covers. If the same system reconnects it just
> overwrites the existing data for that system id.

Systemid doesn't work for that. Systemid is assigned at initdb time, so
all the standbys have the same systemid as the master.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-09 21:22:14
Message-ID:	4C607186.7050006@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wonder if we can continue to rely on the pg_sleep() loop for sleeping
in walsender. On those platforms where interrupts don't interrupt sleep,
sending the signal is not going to promptly wake up walsender. That was
fine before, but any delay is going to be poison to synchronous
replication performance.

Thoughts?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-09 21:34:55
Message-ID:	201008092134.o79LYtx15333@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> On Wed, Aug 4, 2010 at 10:38 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> > Then you risk running out of disk space. Similar to having an archive
> > command that fails for some reason.
> >
> > That's one reason the registration should not be too automatic - there is
> > serious repercussions if the standby just disappears. If the standby is a
> > synchronous one, the master will stop committing or delay acknowledging
> > commits, depending on the configuration, and the master needs to keep extra
> > WAL around.
>
> Umm... in addition to registration of each standby, I think we should allow
> users to set the upper limit of the number of WAL files kept in pg_xlog to
> avoid running out of disk space. If it exceeds the upper limit, the master
> disconnects too old standbys from the cluster and removes all the WAL files
> not required for current connected standbys. If you don't want any standby
> to disappear unexpectedly because of the upper limit, you can set it to 0
> (= no limit).
>
> I'm thinking to make users register and unregister each standbys via SQL
> functions like register_standby() and unregister_standby():
>
> void register_standby(standby_name text, streaming_start_lsn text)
> void unregister_standby(standby_name text)
>
> Note that standby_name should be specified in recovery.conf of each
> standby.
>
> By using them we can easily specify which WAL files are unremovable because
> of new standby when taking the base backup for it as follows:
>
> SELECT register_standby('foo', pg_start_backup())

I know there has been discussion about how to identify the standby
servers --- how about using the connection application_name in
recovery.conf:

primary_conninfo = 'host=localhost port=5432 application_name=slave1'

The good part is that once recovery.conf goes away because it isn't a
standby anymore, the the application_name is gone.

An even more interesting approach would be to specify the replication
mode in the application_name:

primary_conninfo = 'host=localhost port=5432 application_name=replay'

and imagine being able to view the status of standby servers from
pg_stat_activity. (Right now standby servers do not appear in
pg_stat_activity.)

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous replication
Date:	2010-08-16 12:50:24
Message-ID:	4C693410.8010502@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 05/08/10 13:40, Fujii Masao wrote:
> On Wed, Aug 4, 2010 at 12:35 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> There's some race conditions with the signaling. If another process finishes
>> XLOG flush and sends the signal when a walsender has just finished one
>> iteration of its main loop, walsender will reset xlogsend_requested and go
>> to sleep. It should not sleep but send the pending WAL immediately.
>
> Yep. To avoid that race condition, xlogsend_requested should be reset to
> false after sleep and before calling XLogSend(). I attached the updated
> version of the patch.

There's still a small race condition: if you receive the signal just
before entering pg_usleep(), it will not be interrupted.

Of course, on platforms where signals don't interrupt sleep, the problem
is even bigger. Magnus reminded me that we can use select() instead of
pg_usleep() on such platforms, but that's still vulnerable to the race
condition.

ppoll() or pselect() could be used, but I don't think they're fully
portable. I think we'll have to resort to the self-pipe trick mentioned
in the Linux select(3) man page:

> On systems that lack pselect(), reliable (and
> more portable) signal trapping can be achieved using the self-pipe
> trick (where a signal handler writes a byte to a pipe whose other end
> is monitored by select() in the main program.)

Another idea is to use something different than Unix signals, like
ProcSendSignal/ProcWaitForSignal which are implemented using semaphores.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com