Synchronous replication, network protocol

Lists: pgsql-hackers
From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pavan Deolasee <pavan(dot)deolasee(at)enterprisedb(dot)com>
Subject: Synchronous replication, network protocol
Date: 2008-12-23 16:23:38
Message-ID: 4951108A.5040608@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The protocol between primary and standby haven't been discussed or
documented in detail.

I don't think it's enough to just stream WAL as it's generated, so
here's my proposal. Messages marked with "(later)" are for features that
have been discussed, but no one is implementing for 8.4. The messages
are sent like in the frontend/backend protocol. The handshake can work
like in the current patch, although I don't think we need or should
allow running regular queries before entering "replication mode". the
backend should become a walsender process directly after authentication.

Standby -> primary

RequestWAL <begin> <end>
Primary should respond with a WALRange message containing the given
range of WAL data.

StartReplication <begin>
Primary should send all already-generated WAL beginning from <begin>,
and then keep sending as it's generated.

ReplicatedUpTo <end>
Acknowledge that all WAL up to <end> has been successfully received and
written to disk and/or fsync'd (depending on the replication mode in
use). The primary can use this information to acknowledge a transaction
as committed to the client in case of synchronous replication.

(later) OldestXmin <xid>
When a hot standby server is running read-only queries, indicates the
current OldestXmin on the standby. The primary can refrain from
vacuuming tuples still required by the slave using this value, if so
configured. That will ensure that the standby doesn't need to stall WAL
application because of read-only queries.

(later) RequestBaseBackup
Request a new base backup to be sent. This can be used to initialize a
new slave.

Primary -> standby

WALRange <begin> <end> <data>
Response to RequestWAL or StartReplication message. After receiving a
StartReplication message, the primary can send these messages when it
feels like it. In synchronous mode, that would be at least at each
commit. The standby should respond with a ReplicatedUpTo message to each
WALRange message.

(later) BaseBackup <data>
A base backup, in response to RequestBaseBackup message. For example,
in .tar.gz format.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Pavan Deolasee <pavan(dot)deolasee(at)enterprisedb(dot)com>
Subject: Re: Synchronous replication, network protocol
Date: 2008-12-23 17:15:08
Message-ID: 1230052508.4793.913.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Tue, 2008-12-23 at 18:23 +0200, Heikki Linnakangas wrote:
> (later) OldestXmin <xid>
> When a hot standby server is running read-only queries,
> indicates the
> current OldestXmin on the standby. The primary can refrain from
> vacuuming tuples still required by the slave using this value, if so
> configured.

This is all reading like you are relaying someone else's thoughts, or
that of a committee.

The above is the exact opposite of your position on 11 Sep, where you
said having a matching xmin between primary and standby "makes an awful
solution for high availability" which Richard, Greg, Robert at least
agreed explicitly with.

I *am* happy to rediscuss this aspect, because I think you may now see
the problems with what people had earlier ruled out. But it would be
good to understand why the 180 degree manoeuvre before we start coding
up protocol changes.

> That will ensure that the standby doesn't need to stall WAL
> application because of read-only queries.

It doesn't need to. That is already optional.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Pavan Deolasee <pavan(dot)deolasee(at)enterprisedb(dot)com>
Subject: Re: Synchronous replication, network protocol
Date: 2008-12-23 17:53:55
Message-ID: 1230054835.4793.939.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Tue, 2008-12-23 at 18:23 +0200, Heikki Linnakangas wrote:

> I don't think we need or should
> allow running regular queries before entering "replication mode". the
> backend should become a walsender process directly after authentication.

+1

> Standby -> primary
>
> RequestWAL <begin> <end>
> Primary should respond with a WALRange message containing the given
> range of WAL data.
>
> StartReplication <begin>
> Primary should send all already-generated WAL beginning from <begin>,
> and then keep sending as it's generated.

Can you give a quick example of how these would be used?

Fujii-san and others considered that having replication start early was
an important requirement. If we do these operations serially on the same
connection
* copy all bulk data
* start streaming
then there is a considerable delay before replication can begin. In the
case of some large sites, perhaps as long as 18-24 hrs.

> ReplicatedUpTo <end>
> Acknowledge that all WAL up to <end> has been successfully received and
> written to disk and/or fsync'd (depending on the replication mode in
> use). The primary can use this information to acknowledge a transaction
> as committed to the client in case of synchronous replication.

+1

> Primary -> standby
>
> WALRange <begin> <end> <data>
> Response to RequestWAL or StartReplication message. After receiving a
> StartReplication message, the primary can send these messages when it
> feels like it. In synchronous mode, that would be at least at each
> commit. The standby should respond with a ReplicatedUpTo message to each
> WALRange message.

+1

> (later) RequestBaseBackup
> Request a new base backup to be sent. This can be used to initialize a
> new slave.

> (later) BaseBackup <data>
> A base backup, in response to RequestBaseBackup message. For example,
> in .tar.gz format.

Experience from Slony shows that single-threading the initial data send
is not a great idea for large databases, since it limits the bandwidth
even if you have more available. (Slony has no choice because of the
current single-transaction=> single-thread requirement). Being able to
take a base backup in parallel is an important feature with large
databases. I think we need to offer an option here rather than force use
of a single thread, though that may be a more convenient option for many
people I would agree.

Rumour has it that Slony might move towards a synchronisation that used
a base backup and PITR as its starting point.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>
Subject: Re: Synchronous replication, network protocol
Date: 2008-12-23 18:42:48
Message-ID: 3f0b79eb0812231042n14b57a66p7b8d7cb726c5d3c0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Thanks for clarifying!

On Wed, Dec 24, 2008 at 2:53 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
> On Tue, 2008-12-23 at 18:23 +0200, Heikki Linnakangas wrote:
>
>> I don't think we need or should
>> allow running regular queries before entering "replication mode". the
>> backend should become a walsender process directly after authentication.
>
> +1

OK, I will re-examine it. But, at least, we need to send ReadyForQuery
message after authentication before sending WAL, because walreceiver
uses libpq (PQsetdbLogin), which doesn't return until receiving
ReadyForQuery.

>
>> Standby -> primary
>>
>> RequestWAL <begin> <end>
>> Primary should respond with a WALRange message containing the given
>> range of WAL data.
>>
>> StartReplication <begin>
>> Primary should send all already-generated WAL beginning from <begin>,
>> and then keep sending as it's generated.
>
> Can you give a quick example of how these would be used?
>
> Fujii-san and others considered that having replication start early was
> an important requirement. If we do these operations serially on the same
> connection
> * copy all bulk data
> * start streaming
> then there is a considerable delay before replication can begin. In the
> case of some large sites, perhaps as long as 18-24 hrs.

Agreed. In very busy system, if those operations are performed serially,
we might not be able to start streaming. I mean that the speed to
generate WAL might be higher than that to copy them.

>
>> ReplicatedUpTo <end>
>> Acknowledge that all WAL up to <end> has been successfully received and
>> written to disk and/or fsync'd (depending on the replication mode in
>> use). The primary can use this information to acknowledge a transaction
>> as committed to the client in case of synchronous replication.
>
> +1

Yes.

>
>> Primary -> standby
>>
>> WALRange <begin> <end> <data>
>> Response to RequestWAL or StartReplication message. After receiving a
>> StartReplication message, the primary can send these messages when it
>> feels like it. In synchronous mode, that would be at least at each
>> commit. The standby should respond with a ReplicatedUpTo message to each
>> WALRange message.
>
> +1

Currently, <begin> is not sent because it can be calculated from <end> and
data length. This would decrease a network traffic in some degree.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Pavan Deolasee <pavan(dot)deolasee(at)enterprisedb(dot)com>
Subject: Re: Synchronous replication, network protocol
Date: 2008-12-29 11:02:04
Message-ID: 4958AE2C.7010601@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Tue, 2008-12-23 at 18:23 +0200, Heikki Linnakangas wrote:
>> (later) OldestXmin <xid>
>> When a hot standby server is running read-only queries,
>> indicates the
>> current OldestXmin on the standby. The primary can refrain from
>> vacuuming tuples still required by the slave using this value, if so
>> configured.
>
> This is all reading like you are relaying someone else's thoughts, or
> that of a committee.

No, I can assure you all the confusing words are from my head only :-).

> The above is the exact opposite of your position on 11 Sep, where you
> said having a matching xmin between primary and standby "makes an awful
> solution for high availability" which Richard, Greg, Robert at least
> agreed explicitly with.

It does, for high availability. There's other use cases where it might
be desired (spreading load of read-only queries across servers). And a
softer version where the master only respects the slaves OldestXmin up
to a point is a good compromise for high availability setups too.

I haven't seen any one-size-fits-all solution to this issue, so we have
to cater for many. Note that I proposed this exact scheme, where the
slave sends its OldestXmin to the master, at the bottom of that same email.

>> That will ensure that the standby doesn't need to stall WAL
>> application because of read-only queries.
>
> It doesn't need to. That is already optional.

Oh right. I should've added, "without having to kill queries".

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Hannu Krosing <hannu(at)krosing(dot)net>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Pavan Deolasee <pavan(dot)deolasee(at)enterprisedb(dot)com>
Subject: Re: Synchronous replication, network protocol
Date: 2008-12-29 22:48:54
Message-ID: 1230590935.7284.2.camel@huvostro
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2008-12-29 at 13:02 +0200, Heikki Linnakangas wrote:

> >> That will ensure that the standby doesn't need to stall WAL
> >> application because of read-only queries.
> >
> > It doesn't need to. That is already optional.
>
> Oh right. I should've added, "without having to kill queries".

Even killing queries is optional, though it will need help from external
filesystem level snapshot feature.

--------------
Hannu


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Pavan Deolasee <pavan(dot)deolasee(at)enterprisedb(dot)com>
Subject: Re: Synchronous replication, network protocol
Date: 2008-12-30 10:42:45
Message-ID: 1230633765.4793.1239.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Mon, 2008-12-29 at 13:02 +0200, Heikki Linnakangas wrote:

> I haven't seen any one-size-fits-all solution to this issue, so we
> have to cater for many.

Very much agree. I've had the chance to speak to many people about the
way they would like this to work and there is definitely no consensus
from those users. So a variety of approaches is appropriate.

> Note that I proposed this exact scheme, where the
> slave sends its OldestXmin to the master, at the bottom of that same
> email.

Anyway, as long as it is optional, I see no problem in including it,
since we have other mechanisms to choose from and nobody is forced to
use this.

The design/implementation for this is fairly easy, I think.

The difficulty is arriving at an easy-to-use control mechanism that is
also secure.

The options for handling a conflict are these:
1. Ignore the conflict (and allow silent wrong answers)
2. Allow the conflicting query to progress until it sees changed data
3. Cancel the query
4. Prevent applying WAL
5. Feed OldestXmin back to primary to prevent conflicting WAL

The current mechanism is (4) for up to max_standby_delay, then (3).

(4) and (5) are both system wide effects: (4) system wide effect on the
standby and (5) is a system wide effect on primary. In both of those
cases that option should be super-user only controlled. I would be
unhappy to think that a normal standby user could create
difficult-to-diagnose problems on primary.

So I see a problem in making (5) optional and super-user controlled.

One way around this is to have the option turn on|off via a function,
which can then be granted to other users.

That for me is beginning to sound fairly ugly: difficult to understand
and difficult to use. But I see some people might want that in certain
circumstances. So I guess we should build it. Any good ideas for the
control mechanism?

I now think we should provide (2) as well, in addition to this.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Pavan Deolasee <pavan(dot)deolasee(at)enterprisedb(dot)com>
Subject: Re: Synchronous replication, network protocol
Date: 2008-12-30 12:40:46
Message-ID: 495A16CE.2020902@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> The difficulty is arriving at an easy-to-use control mechanism that is
> also secure.
>
> The options for handling a conflict are these:
> 1. Ignore the conflict (and allow silent wrong answers)
> 2. Allow the conflicting query to progress until it sees changed data
> 3. Cancel the query
> 4. Prevent applying WAL
> 5. Feed OldestXmin back to primary to prevent conflicting WAL
>
> The current mechanism is (4) for up to max_standby_delay, then (3).
>
> (4) and (5) are both system wide effects: (4) system wide effect on the
> standby and (5) is a system wide effect on primary. In both of those
> cases that option should be super-user only controlled. I would be
> unhappy to think that a normal standby user could create
> difficult-to-diagnose problems on primary.
>
> So I see a problem in making (5) optional and super-user controlled.
>
> One way around this is to have the option turn on|off via a function,
> which can then be granted to other users.
>
> That for me is beginning to sound fairly ugly: difficult to understand
> and difficult to use. But I see some people might want that in certain
> circumstances. So I guess we should build it. Any good ideas for the
> control mechanism?

Using functions seems overly complicated. Since xids are system-wide, I
don't see much value in specifying them at any finer level, or in
allowing them for non-superusers. GUC seems like the natural choice.

I think the options you have in the patch now, and max_standby_delay to
control it, is enough for this release.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Pavan Deolasee <pavan(dot)deolasee(at)enterprisedb(dot)com>
Subject: Re: Synchronous replication, network protocol
Date: 2008-12-30 13:54:43
Message-ID: 1230645283.4793.1284.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Tue, 2008-12-30 at 14:40 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> >
> > That for me is beginning to sound fairly ugly: difficult to understand
> > and difficult to use. But I see some people might want that in certain
> > circumstances. So I guess we should build it. Any good ideas for the
> > control mechanism?
>
> Using functions seems overly complicated.

We agree on that.

> Since xids are system-wide, I
> don't see much value in specifying them at any finer level, or in
> allowing them for non-superusers. GUC seems like the natural choice.

Well, GUCs have security implications that I'm not happy about. I will
relent if you will vouch for that decision.

"standby_xmin_on_primary"

(boolean) - a USERSET GUC that only has meaning during standby query
execution. <name> specifies whether the current standby session's xmin
is included in the calculation of OldestXmin on the *primary* node. If
this parameter is true then the standby query will never be cancelled
because of conflicts between the activity of the primary and standby
(see discussion in chapter XXXX). The downside of using this parameter
is that standby queries can cause table bloat on the primary (see
chapter Data Maintenance for more detail).

"standby_xmin_on_primary" - new name sought. I think it should begin
with "standby_" to remind us that it only effects standby query
processing.

Implementation:

WALReceiver will send message back to WALSender. WALSender will update a
single 4 byte value, RemoteXmin that is read during GetSnapshotData().
Updating value will not hold a lock, just as xid is not locked when
setting new value.

We add a boolean to each proc: SendRemoteXmin. When we run
GetSnapshotData() if our own proc has SendRemoteXmin set then we
calculate RemoteXmin from the minimum of any proc with SendRemoteXmin
set. When we release our snapshot we re-calculate RemoteXmin so that the
primary node suffers as little delay as possible in receiving updates to
xmin.

I'll begin work on this once sync rep is committed. It's about 3-5 days
work, but no point in writing it yet because the sand will shift
underneath it too much in the next few weeks.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support