Re: write ahead logging in standby (streaming replication)

Lists: pgsql-hackers
From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: write ahead logging in standby (streaming replication)
Date: 2009-11-12 02:31:41
Message-ID: 3f0b79eb0911111831i2e053eeaif2d00d4d52d313e1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Should the standby also have to follow the WAL rule during recovery?
The current patch doesn't care about the write order of the data page
and WAL in the standby. So, after both servers fail, restarting the
ex-standby by itself might corrupt the data.

If the standby follows the WAL rule, walreceiver might delay in
writing WAL records until the startup process' or bgwriter's fsync
have been finished. I'm a bit concerned that such delay might
increase the performance overhead on the primary.

Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-12 03:03:36
Message-ID: 14163.1257995016@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> Should the standby also have to follow the WAL rule during recovery?
> The current patch doesn't care about the write order of the data page
> and WAL in the standby. So, after both servers fail, restarting the
> ex-standby by itself might corrupt the data.

Surely the receiver should fsync the WAL itself to disk before
acknowledging it. Assuming you've done that, I don't see any
corruption risk.

regards, tom lane


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-12 04:31:23
Message-ID: 3f0b79eb0911112031w6bf54d80r89957de9f90c6487@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 12, 2009 at 12:03 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
>> Should the standby also have to follow the WAL rule during recovery?
>> The current patch doesn't care about the write order of the data page
>> and WAL in the standby. So, after both servers fail, restarting the
>> ex-standby by itself might corrupt the data.
>
> Surely the receiver should fsync the WAL itself to disk before
> acknowledging it.  Assuming you've done that, I don't see any
> corruption risk.

"acknowledging it" means "letting the startup process know the arrival
of WAL records"? If so, I agree that there is no risk of data corruption.

The problem is that fsync needs to be issued too frequently, which would
be harmless in asynchronous replication, but not in synchronous one.
A transaction would have to wait for the primary's and standby's fsync
before returning a "success" to a client.

So I'm inclined to change the startup process and bgwriter, instead of
walreceiver, so as to fsync the WAL for the WAL rule.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-12 07:32:05
Message-ID: 4AFBB9F5.4080503@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao wrote:
> The problem is that fsync needs to be issued too frequently, which would
> be harmless in asynchronous replication, but not in synchronous one.
> A transaction would have to wait for the primary's and standby's fsync
> before returning a "success" to a client.
>
> So I'm inclined to change the startup process and bgwriter, instead of
> walreceiver, so as to fsync the WAL for the WAL rule.

Let's keep it simple for now. Just make the walreceiver do the fsync. We
can optimize later. For now, we're only going to have async mode anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-12 08:03:37
Message-ID: 3f0b79eb0911120003o1b0aadddg90db49eba55bb79c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Thu, Nov 12, 2009 at 4:32 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Fujii Masao wrote:
>> The problem is that fsync needs to be issued too frequently, which would
>> be harmless in asynchronous replication, but not in synchronous one.
>> A transaction would have to wait for the primary's and standby's fsync
>> before returning a "success" to a client.
>>
>> So I'm inclined to change the startup process and bgwriter, instead of
>> walreceiver, so as to fsync the WAL for the WAL rule.
>
> Let's keep it simple for now. Just make the walreceiver do the fsync. We
> can optimize later. For now, we're only going to have async mode anyway.

Okey, I'll do that; the walreceiver issues the fsync for each arrival of
the WAL records, and the startup process replays only the records already
fsynced.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-12 09:27:22
Message-ID: 1258018042.14054.103.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2009-11-12 at 17:03 +0900, Fujii Masao wrote:

> On Thu, Nov 12, 2009 at 4:32 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> > Fujii Masao wrote:
> >> The problem is that fsync needs to be issued too frequently, which would
> >> be harmless in asynchronous replication, but not in synchronous one.
> >> A transaction would have to wait for the primary's and standby's fsync
> >> before returning a "success" to a client.
> >>
> >> So I'm inclined to change the startup process and bgwriter, instead of
> >> walreceiver, so as to fsync the WAL for the WAL rule.
> >
> > Let's keep it simple for now. Just make the walreceiver do the fsync. We
> > can optimize later. For now, we're only going to have async mode anyway.
>
> Okey, I'll do that; the walreceiver issues the fsync for each arrival of
> the WAL records, and the startup process replays only the records already
> fsynced.

I agree with you, though it has taken some time to understand what you
said and at first my reaction was to disagree. I think the responses you
got on this are because you dived straight in with a question before
explaining other things around this.

We already have a number of options for how to handle incoming WAL. We
can choose to fsync or not when WAL arrives. Choosing *not* to fsync
would be the typical choice because it provides reasonable performance;
fsyncing after each transaction commit would be worse. In any case, if
WAL receiver does the fsyncs then we will get worse performance. If we
reduce the number of fsyncs it does we just get spiky behaviour around
the fsyncs.

If recovery starts reading WAL records that have not been fsynced then
we may need to flush a shared buffer to disk that depends upon a
non-fsynced(yet) WAL record. Fsyncing WAL after *every* WAL record is
going to make performance suck even worse and is completely out of the
question. So implementing the fsync-WAL-before-buffer-flush rule during
recovery makes much more sense. It's also only small change during
XlogFlush().

Another way of doing this would be to only allow recovery to progress as
far as has been fsynced. That seems a more plausible approach, but would
lead to delays if we had a small number of long write transactions. The
benefit of streaming is that it potentially allows us to keep as near to
real-time recovery as possible.

So overall, yes, we need to do as you suggested: implement WAL rule in
recovery. WALreceiver smoothly does write(), Startup replays and we
leave the WAL file fsyncs to be performed by the bgwriter.

But I also agree with Heikki. Let's plan to do this later in this
release.

--
Simon Riggs www.2ndQuadrant.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-12 12:45:35
Message-ID: 3f0b79eb0911120445h6bf69c4dlbf31e3b39ca2c36a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 12, 2009 at 6:27 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> I agree with you, though it has taken some time to understand what you
> said and at first my reaction was to disagree. I think the responses you
> got on this are because you dived straight in with a question before
> explaining other things around this.

Thanks for clarifying this topic ;)

> If recovery starts reading WAL records that have not been fsynced then
> we may need to flush a shared buffer to disk that depends upon a
> non-fsynced(yet) WAL record. Fsyncing WAL after *every* WAL record is
> going to make performance suck even worse and is completely out of the
> question. So implementing the fsync-WAL-before-buffer-flush rule during
> recovery makes much more sense. It's also only small change during
> XlogFlush().

Agreed. This approach has lesser impact on the performance.

But, as I said on my first post on this thread, even such low-frequent
fsync-WAL-before-buffer-flush might cause a response time spike on the
primary because the walreceiver must sleep during that fsync. I think
that leaving the WAL-logging business to another process like walwriter
is a good idea for reducing further the impact on the walreceiver; In
typical case,

* The walreceiver receives WAL records, returns the ACK to the primary,
saves them in the wal_buffers, and lets the startup process know
the arrival.

* The walwriter writes and fsyncs the WAL records in the wal_buffers.

* The startup process applies the WAL records in the wal_buffers
when it receives the notice of the arrival.

* The startup process and bgwriter fsyncs the WAL before the buffer
flush.

Of course, since this approach is too complicated, it's out of the scope
of the development for v8.5.

> But I also agree with Heikki. Let's plan to do this later in this
> release.

Okey. I implement nothing around this topic until the core part of
asynchronous replication will have been committed.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-12 12:53:12
Message-ID: 1258030392.14054.189.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2009-11-12 at 21:45 +0900, Fujii Masao wrote:

> But, as I said on my first post on this thread, even such low-frequent
> fsync-WAL-before-buffer-flush might cause a response time spike on the
> primary because the walreceiver must sleep during that fsync. I think
> that leaving the WAL-logging business to another process like walwriter
> is a good idea for reducing further the impact on the walreceiver; In
> typical case,

Agree completely.

> Of course, since this approach is too complicated, it's out of the scope
> of the development for v8.5.

It's out of scope for phase 1, certainly.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-12 14:52:24
Message-ID: 24069.1258037544@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> The problem is that fsync needs to be issued too frequently, which would
> be harmless in asynchronous replication, but not in synchronous one.
> A transaction would have to wait for the primary's and standby's fsync
> before returning a "success" to a client.

Surely that is exactly what is *required* if the user has asked for
synchronous replication.

regards, tom lane


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-12 16:49:18
Message-ID: 4AFC3C8E.1070505@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
>
>> The problem is that fsync needs to be issued too frequently, which would
>> be harmless in asynchronous replication, but not in synchronous one.
>> A transaction would have to wait for the primary's and standby's fsync
>> before returning a "success" to a client.
>>
>
> Surely that is exactly what is *required* if the user has asked for
> synchronous replication.
>
This a distressingly common thing people get wrong about replication.
You can either have synchronous replication, which as you say has to be
slow: you must wait for an fsync ACK from the secondary and a return
trip before you can say something is committed on the primary. Or you
can get better performance by not waiting for all of those things, but
the minute you do that it's *not* synchronous replication anymore. You
can't get high-performance and true synchronous behavior; you have to
pick one. The best you can do if you need both is work on accelerating
fsync everywhere using the standard battery-backed write cache technique.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-13 01:52:34
Message-ID: 3f0b79eb0911121752n1cf2e44n18452b666a09d55e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 13, 2009 at 1:49 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> This a distressingly common thing people get wrong about replication.  You
> can either have synchronous replication, which as you say has to be slow:
> you must wait for an fsync ACK from the secondary and a return trip before
> you can say something is committed on the primary.  Or you can get better
> performance by not waiting for all of those things, but the minute you do
> that it's *not* synchronous replication anymore.  You can't get
> high-performance and true synchronous behavior; you have to pick one.  The
> best you can do if you need both is work on accelerating fsync everywhere
> using the standard battery-backed write cache technique.

I'm not happy that such frequent fsyncs would harm even semi-synchronous
replication (i.e., you must wait for a *recv* ACK from the secondary
and a return
trip before you can say something is committed on the primary. This corresponds
to the DRBD's protocol B) rather than synchronous one. Personally, I think that
semi-synchronous replication is sufficient for HA.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-13 01:58:19
Message-ID: 20091113015819.GE17573@oak.highrise.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Fujii Masao <masao(dot)fujii(at)gmail(dot)com> [091112 20:52]:

> Personally, I think that
> semi-synchronous replication is sufficient for HA.

Often, but that's not synchronous replication so don't call it such...

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-13 02:15:05
Message-ID: 4AFCC129.1010908@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao wrote:
> Personally, I think that semi-synchronous replication is sufficient for HA.
>
Whether or not you think it's sufficient for what you have in mind,
"synchronous replication" requires a return ACK from the secondary
before you say things are committed on the primary. If you don't do
that, it's not true sync replication anymore; it's asynchronous
replication. Plenty of people decide that a local commit combined with
a promise to synchronize as soon as possible to the slave is good enough
for their apps, which as you say is getting referred to as
"semi-synchronous replication" nowadays. That's an awful name though,
because it's not true--that's asynchronous replication, just aiming for
minimal lag. It's OK to say that's what you want, but you can't say
it's really a synchronous commit anymore if you do things that way.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-13 02:20:16
Message-ID: 3f0b79eb0911121820p2eae030aq9a8a638c7ade231a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 13, 2009 at 10:58 AM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> * Fujii Masao <masao(dot)fujii(at)gmail(dot)com> [091112 20:52]:
>
>>                                                        Personally, I think that
>> semi-synchronous replication is sufficient for HA.
>
> Often, but that's not synchronous replication so don't call it such...

Hmm, though I'm not sure about your definition of "synchronous",
if the primary waits for a *redo* ACK from the standby before
returning a "success" of a transaction to a client, you can call
SR synchronous?

This is one of TODO items of SR for v8.5.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-13 02:37:24
Message-ID: 3f0b79eb0911121837s7cabdb29j5a76b62530ead3d2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 13, 2009 at 11:15 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Whether or not you think it's sufficient for what you have in mind,
> "synchronous replication" requires a return ACK from the secondary before
> you say things are committed on the primary.  If you don't do that, it's not
> true sync replication anymore; it's asynchronous replication.  Plenty of
> people decide that a local commit combined with a promise to synchronize as
> soon as possible to the slave is good enough for their apps, which as you
> say is getting referred to as "semi-synchronous replication" nowadays.
>  That's an awful name though, because it's not true--that's asynchronous
> replication, just aiming for minimal lag.  It's OK to say that's what you
> want, but you can't say it's really a synchronous commit anymore if you do
> things that way.

Umm... what is your definition of "synchronous"? I'm planning to provide
four synchronization modes as follows, for v8.5. Does this fit in your
thought?

The primary waits ... before returning "success" of a transaction;
* nothing - asynchronous replication
* recv ACK - semi-synchronous replication
* fsync ACK - semi-synchronous replication
* redo ACK - synchronous replication

Or, in synchronous replication, we must wait a fsync and a redo ACK?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-13 02:54:31
Message-ID: 407d949e0911121854ob2a7b5v92dd983f23b65cc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 13, 2009 at 2:37 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> Umm... what is your definition of "synchronous"? I'm planning to provide
> four synchronization modes as follows, for v8.5. Does this fit in your

I think my definition would be that a query against the replica will
produce the same result as a query against the master -- and that that
will be the case even after a system failure. That might not
necessarily mean that the log entry is fsynced on the replica, only
that it's fsynced in a location where the replica will have access to
it when it runs recovery.

I do have a different question though. What do you plan to do if
there's a failure when they're out of sync? The master hasn't
responded to the commit yet because it's still waiting on the replica
to respond but it has already recorded the commit itself. When it
comes back up it's out of sync with the replica and has to resend
those records? What if the replica has already received it and it was
the confirmation which was lost?

--
greg


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-13 04:49:06
Message-ID: 4AFCE542.4040104@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao wrote:
> Umm... what is your definition of "synchronous"? I'm planning to provide
> four synchronization modes as follows, for v8.5. Does this fit in your
> thought?
>
> The primary waits ... before returning "success" of a transaction;
> * nothing - asynchronous replication
> * recv ACK - semi-synchronous replication
> * fsync ACK - semi-synchronous replication
> * redo ACK - synchronous replication
>
> Or, in synchronous replication, we must wait a fsync and a redo ACK?
>
Right, those are the possibilities, all four of them have valid use
cases in the field and are worth implementing. I don't like the label
"semi-synchronous replication" myself, but it's a valuable feature to
implement, and that is unfortunately the term other parts of the
industry use for that approach.

But everyone needs to be extremely careful with the terminology here:
if you say "synchronous replication", that *only* means what you're
labeling "redo ACK" ("WAL ACK" really). "Synchronous replication"
should not be used as a group term that includes the semi-synchronous
variations, which are in fact asynchronous despite their marketing
name. If someone means semi-synchronous, but they say synchronous
thinking it's a shared term also applicable to the semi-synchronous
variations here, that's just going to be confusing for everyone.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-13 05:19:43
Message-ID: 3f0b79eb0911122119m7a3c1cl1c1564c69794266f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 13, 2009 at 11:54 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> I think my definition would be that a query against the replica will
> produce the same result as a query against the master -- and that that
> will be the case even after a system failure. That might not
> necessarily mean that the log entry is fsynced on the replica, only
> that it's fsynced in a location where the replica will have access to
> it when it runs recovery.

Agreed.

> I do have a different question though. What do you plan to do if
> there's a failure when they're out of sync? The master hasn't
> responded to the commit yet because it's still waiting on the replica
> to respond but it has already recorded the commit itself. When it
> comes back up it's out of sync with the replica and has to resend
> those records? What if the replica has already received it and it was
> the confirmation which was lost?

If the connection is not closed, the resending is not required because
TCP would guarantee that such records arrive at the standby someday.

Otherwise, the standby re-connects to the primary, and asks for the
missing records, so the resending would be done. Since only the missing
records are requested, the already received records don't reach the
standby again, I think.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-13 05:29:30
Message-ID: 3f0b79eb0911122129h55624e40y53417dd348d8f898@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 13, 2009 at 1:49 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Right, those are the possibilities, all four of them have valid use cases in
> the field and are worth implementing.  I don't like the label
> "semi-synchronous replication" myself, but it's a valuable feature to
> implement, and that is unfortunately the term other parts of the industry
> use for that approach.

BTW, MySQL and DRBD use the term "semi-synchronous":
http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication
http://www.drbd.org/users-guide/s-replication-protocols.html

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Robert Hodges <robert(dot)hodges(at)continuent(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-13 06:07:57
Message-ID: C72237BD.17536%robert.hodges@continuent.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Greg and Fujii,

Just a point on terminology: there's a difference in the usage of
semi-synchronous between DRBD and MySQL semi-synchronous replication, which
was originally developed by Google.

In the Google case semi-synchronous replication is a quorum algorithm where
clients receive a commit notification only after at least one of N slaves
has received the replication event. In the DRBD case semi-synchronous means
that events have reached the slave but are not necessarily durable. There's
no quorum.

Of these two usages the Google semi-sync approach is the more interesting
because it avoids the availability problems associated with fully
synchronous operation but gets most of the durability benefits.

Cheers, Robert

On 11/12/09 9:29 PM PST, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com> wrote:

> On Fri, Nov 13, 2009 at 1:49 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>> Right, those are the possibilities, all four of them have valid use cases in
>> the field and are worth implementing.  I don't like the label
>> "semi-synchronous replication" myself, but it's a valuable feature to
>> implement, and that is unfortunately the term other parts of the industry
>> use for that approach.
>
> BTW, MySQL and DRBD use the term "semi-synchronous":
> http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication
> http://www.drbd.org/users-guide/s-replication-protocols.html
>
> Regards,
>
> --
> Fujii Masao
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-13 06:17:48
Message-ID: 4AFCFA0C.7010703@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao wrote:
> On Fri, Nov 13, 2009 at 1:49 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>
>> Right, those are the possibilities, all four of them have valid use cases in
>> the field and are worth implementing. I don't like the label
>> "semi-synchronous replication" myself, but it's a valuable feature to
>> implement, and that is unfortunately the term other parts of the industry
>> use for that approach.
>>
>
> BTW, MySQL and DRBD use the term "semi-synchronous":
> http://forge.mysql.com/wiki/ReplicationFeatures/SemiSyncReplication
> http://www.drbd.org/users-guide/s-replication-protocols.html
>
Yeah, that's the "other parts of the industry" I was referring to.
MySQL uses "semi-synchronous" to distinguish between its completely
asynchronous default replication mode and one where it provides a
somewhat safer implementation. The description reads more as
"asynchronous with some synchronous elements", not "one style of
synchronous implementation". None of their documentation wanders into
the problem area here by calling it a true synchronous solution when
it's really not--MySQL Cluster is their synchronous vehicle.

It's fine to adopt the term "semi-synchronous", as it's become quite
popular and people are going to label the PG implementation with it
regardless of what is settled on here. But we should all try to be
careful to use it as correctly as possible.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-13 09:18:15
Message-ID: 3f0b79eb0911130118x230a76cbo301fcfa1c0bd0f0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 13, 2009 at 3:17 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Yeah, that's the "other parts of the industry" I was referring to.  MySQL
> uses "semi-synchronous" to distinguish between its completely asynchronous
> default replication mode and one where it provides a somewhat safer
> implementation.  The description reads more as "asynchronous with some
> synchronous elements", not "one style of synchronous implementation".  None
> of their documentation wanders into the problem area here by calling it a
> true synchronous solution when it's really not--MySQL Cluster is their
> synchronous vehicle.
> It's fine to adopt the term "semi-synchronous", as it's become quite popular
> and people are going to label the PG implementation with it regardless of
> what is settled on here.  But we should all try to be careful to use it as
> correctly as possible.

OK. Let's think over what "recv ACK" and "fsync ACK"
synchronization modes should be called later.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-16 18:19:58
Message-ID: 4B0197CE.50006@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Greg Stark wrote:
> I think my definition would be that a query against the replica will
> produce the same result as a query against the master -- and that that
> will be the case even after a system failure. That might not
> necessarily mean that the log entry is fsynced on the replica, only
> that it's fsynced in a location where the replica will have access to
> it when it runs recovery.

I tend to agree with that definition of synchrony for replicated
databases. However, let me point to an earlier thread around the same
topic:
http://archives.postgresql.org/message-id/4942ECF7.5040601@bluegap.ch

You will definitely find different definitions and requirements of what
synchronous replication means there. It convinced me that "synchronous"
is more of a marketing term in this area and is better avoided in
technical documents and discussions, or needs explanation.

As far as marketing goes, there are the customers who absolutely want
synchronous replication for its consistency and then there are the
others who absolutely don't want it due to its unusably high latency.

Regards

Markus Wanner


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-16 20:57:10
Message-ID: 4B01BCA6.7050806@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Markus Wanner wrote:
> You will definitely find different definitions and requirements of what
> synchronous replication means there.
To quote from the Wikipedia entry on "Database Replication" that Simon
pointed to during the earlier discussion,
http://en.wikipedia.org/wiki/Database_replication

"Synchronous replication - guarantees "zero data loss" by the means of
atomic write operation, i.e. write either completes on both sides or not
at all. Write is not considered complete until acknowledgement by both
local and remote storage."

That last part is the critical one: "acknowledgement by both local and
remote storage" is required before you can label something truly
synchronous replication. In implementation terms, that means you must
have both local and slave fsync calls finish to be considered truly
synchronous. That part is not ambiguous at all.

There's a definition of the weaker form in there too, which is where the
ambiguity is at:

"Semi-synchronous replication - this usually means that a write is
considered complete as soon as local storage acknowledges it and a
remote server acknowledges that it has received the write either into
memory or to a dedicated log file."

I don't consider that really synchronous replication anymore, but as you
say it's been strengthened by marketing enough to be a valid industry
term at this point. Since it's already gained traction we might use it,
as long as it's defined properly and its trade-offs vs. a true
synchronous implementation are documented.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com


From: "Markus Wanner" <markus(at)bluegap(dot)ch>
To: "Greg Smith" <greg(at)2ndquadrant(dot)com>
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: write ahead logging in standby (streaming replication)
Date: 2009-11-17 07:31:12
Message-ID: 20091117083112.15974m8e992frpkw@mail.bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Quoting "Greg Smith" <greg(at)2ndquadrant(dot)com>:
> "Synchronous replication - guarantees "zero data loss" by the means
> of atomic write operation, i.e. write either completes on both sides
> or not at all. Write is not considered complete until
> acknowledgement by both local and remote storage."

Note that a storage acknowledge (hopefully) guarantees durability, but
it does not necessarily mean that the transactional changes are
immediately visible on a remote node. Which is what you had in your
definition.

My point is that there are at least three things that can run
synchronously or not, WRT to distributed databases:

1. conflict detection and handling (for consistency)
2. storage acknowledgement (for durability)
3. effective application of changes (for visibility across nodes)

> That last part is the critical one: "acknowledgement by both local
> and remote storage" is required before you can label something truly
> synchronous replication. In implementation terms, that means you
> must have both local and slave fsync calls finish to be considered
> truly synchronous. That part is not ambiguous at all.

I personally agree 100%. (Given it implies a congruent conflict
handling *before* the disk write. Having conflicting transactional
changes on the disk wouldn't help much at recovery time).

(And yes, this means I think the effective application of changes can
be deferred. IMO the load balancer and/or the application should take
care not to send transactions from the same session to different nodes).

> "Semi-synchronous replication

..is plain non-sense to my ears. Either something is synchronous or it
is not. No half, no semi, no virtual synchrony. To have any technical
relevance, one needs to add *what* is synchronous and what not.

In that spirit I have to admit that the term 'eager' that I'm
currently using to describe Postgres-R may not be any more helpful. I
take it to mean synchrony of 1. and 2., but not 3.

Regards

Markus Wanner