Quick Links

Re: Synchronous Standalone Master Redoux

Lists:	pgsql-hackers

From:	Shaun Thomas <sthomas(at)optionshouse(dot)com>
To:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Synchronous Standalone Master Redoux
Date:	2012-07-09 20:30:01
Message-ID:	4FFB3F49.4050108@optionshouse.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hey everyone,

Upon doing some usability tests with PostgreSQL 9.1 recently, I ran
across this discussion:

http://archives.postgresql.org/pgsql-hackers/2011-12/msg01224.php

And after reading the entire thing, I found it odd that the overriding
pushback was because nobody could think of a use case. The argument was:
if you don't care if the slave dies, why not just use asynchronous
replication?

I'd like to introduce all of you to DRBD. DRBD is, for those who aren't
familiar, distributed (network) block-level replication. Right now, this
is what we're using, and will use in the future, to ensure a stable
synchronous PostgreSQL copy on our backup node. I was excited to read
about synchronous replication, because with it, came the possibility we
could have two readable nodes with the servers we already have. You
can't do that with DRBD; secondary nodes can't even mount the device.

So here's your use case:

1. Slave wants to be synchronous with master. Master wants replication
on at least one slave. They have this, and are happy.
2. For whatever reason, slave crashes or becomes unavailable.
3. Master notices no more slaves are available, and operates in
standalone mode, accumulating WAL files until a suitable slave appears.
4. Slave finishes rebooting/rebuilding/upgrading/whatever, and
re-subscribes to the feed.
5. Slave stays in degraded sync (asynchronous) mode until it is caught
up, and then switches to synchronous. This makes both master and slave
happy, because *intent* of synchronous replication is fulfilled.

PostgreSQL's implementation means the master will block until
someone/something notices and tells it to stop waiting, or the slave
comes back. For pretty much any high-availability environment, this is
not viable. Based on that alone, I can't imagine a scenario where
synchronous replication would be considered beneficial.

The current setup doubles unplanned system outage scenarios in such a
way I'd never use it in a production environment. Right now, we only
care if the master server dies. With sync rep, we'd have to watch both
servers like a hawk and be ready to tell the master to disable sync rep,
lest our 10k TPS system come to an absolute halt because the slave died.

With DRBD, when a slave node goes offline, the master operates in
standalone until the secondary re-appears, after which it
re-synchronizes missing data, and then operates in sync mode afterwards.
Just because the data is temporarily out of sync does *not* mean we want
asynchronous replication. I think you'd be hard pressed to find many
users taking advantage of DRBD's async mode. Just because data is
temporarily catching up, doesn't mean it will remain in that state.

I would *love* to have the functionality discussed in the patch. If I
can make a case for it, I might even be able to convince my company to
sponsor its addition, provided someone has time to integrate it. Right
now, we're using DRBD so we can have a very short outage window while
the offline node gets promoted, and it works, but that means a basically
idle server at all times. I'd gladly accept a 10-20% performance hit for
sync rep if it meant that other server could reliably act as a read
slave. That's currently impossible because async replication is too
slow, and sync is too fragile for reasons stated above.

Am I totally off-base, here? I was shocked when I actually read the
documentation on how sync rep worked, and saw that no servers would
function properly until at least two were online.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-09 22:15:18
Message-ID:	4FFB57F6.1090802@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Shaun,

> PostgreSQL's implementation means the master will block until
> someone/something notices and tells it to stop waiting, or the slave
> comes back. For pretty much any high-availability environment, this is
> not viable. Based on that alone, I can't imagine a scenario where
> synchronous replication would be considered beneficial.

So there's an issue with the definition of "synchronous". What
"synchronous" in "synchronous replication" means is "guarantee zero data
loss or fail the transaction". It does NOT mean "master and slave have
the same transactional data at the same time", as much as that would be
great to have.

There are, indeed, systems where you'd rather shut down the system than
accept writes which were not replicated, or we wouldn't have the
feature. That just doesn't happen to fit your needs (nor, indeed, the
needs of most people who think they want SR).

"Total-consistency" replication is what I think you want, that is, to
guarantee that at any given time a read query on the master will return
the same results as a read query on the standby. Heck, *most* people
would like to have that. You would also be advancing database science
in general if you could come up with a way to implement it.

> slave. That's currently impossible because async replication is too
> slow, and sync is too fragile for reasons stated above.

So I'm unclear on why sync rep would be faster than async rep given that
they use exactly the same mechanism. Explain?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Daniel Farina <daniel(at)heroku(dot)com>
To:	sthomas(at)optionshouse(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-10 06:11:39
Message-ID:	CAAZKuFaoT8EQ141hcq433trCikChakJDjdgb7USDx5rAAk5wxA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 9, 2012 at 1:30 PM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:
>
> 1. Slave wants to be synchronous with master. Master wants replication on at least one slave. They have this, and are happy.
> 2. For whatever reason, slave crashes or becomes unavailable.
> 3. Master notices no more slaves are available, and operates in standalone mode, accumulating WAL files until a suitable slave appears.
> 4. Slave finishes rebooting/rebuilding/upgrading/whatever, and re-subscribes to the feed.
> 5. Slave stays in degraded sync (asynchronous) mode until it is caught up, and then switches to synchronous. This makes both master and slave happy, because *intent* of synchronous replication is fulfilled.
>

So if I get this straight, what you are saying is "be asynchronous
replication unless someone is around, in which case be synchronous" is
the mode you want. I think if your goal is zero-transaction loss then
you would want to rethink this, and that was the goal of SR: two
copies, no matter what, before COMMIT returns from the primary.

However, I think there is something you are stating here that has a
finer point on it: right now, there is no graceful way to attenuate
the speed of commit on a primary to ensure bounded lag of an
*asynchronous* standby. This is a pretty tricky definition: consider
if you bring a standby on-line from archive replay and it shows up in
streaming with pretty high lag, and stops all commit traffic while it
reaches the bounded window of what "acceptable" lag is. That sounds
pretty terrible, too. How does DBRD handle this? It seems like the
catchup phase might be interesting prior art.

On first inspection, the best I can come up with something like "if
the standby is making progress and it fails to make progress in
convergence, attenuate the primary's speed of COMMIT until convergence
is projected to occur in a projected time" or something like that.

Relatedly, this is related to one of the one of the ugliest problems I
have with continuous archiving: there is no graceful way to attenuate
the speed of operations to prevent backlog that can fill up the disk
containing pg_xlog. It also makes it very hard to very strictly bound
the amount of data that can remain outstanding and unarchived. To get
around this, I was planning on very carefully making use of the status
messages supplied that inform synchronous replication to block and
unblock operations, but perhaps a less strained interface is possible
with some kind of cooperation from Postgres.

--
fdr

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Daniel Farina'" <daniel(at)heroku(dot)com>, <sthomas(at)optionshouse(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-10 06:42:39
Message-ID:	004c01cd5e67$33d657a0$9b8306e0$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> From: pgsql-hackers-owner(at)postgresql(dot)org
[mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Daniel Farina
> Sent: Tuesday, July 10, 2012 11:42 AM
>>On Mon, Jul 9, 2012 at 1:30 PM, Shaun Thomas <sthomas(at)optionshouse(dot)com>
wrote:
>>
>> 1. Slave wants to be synchronous with master. Master wants replication on
at least one slave. They have this, and are happy.
>> 2. For whatever reason, slave crashes or becomes unavailable.
>> 3. Master notices no more slaves are available, and operates in
standalone mode, accumulating WAL files until a suitable slave appears.
>> 4. Slave finishes rebooting/rebuilding/upgrading/whatever, and
re-subscribes to the feed.
>> 5. Slave stays in degraded sync (asynchronous) mode until it is caught
up, and then switches to synchronous. This makes both master and slave
happy, because *intent* of synchronous replication is fulfilled.
>>

> So if I get this straight, what you are saying is "be asynchronous
> replication unless someone is around, in which case be synchronous" is
> the mode you want. I think if your goal is zero-transaction loss then
> you would want to rethink this, and that was the goal of SR: two
> copies, no matter what, before COMMIT returns from the primary.

For such cases, can there be a way with which an option can be provided to
user if he wants to change mode to async?

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, sthomas(at)optionshouse(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-10 07:38:38
Message-ID:	CABUevEzeA3D9XmOyRBOpKicKABQ-SBL3zqqqFug+m3Pr2eV3Bw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 10, 2012 at 8:42 AM, Amit Kapila <amit(dot)kapila(at)huawei(dot)com> wrote:
>> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Daniel Farina
>> Sent: Tuesday, July 10, 2012 11:42 AM
>>>On Mon, Jul 9, 2012 at 1:30 PM, Shaun Thomas <sthomas(at)optionshouse(dot)com>
> wrote:
>>>
>>> 1. Slave wants to be synchronous with master. Master wants replication on
> at least one slave. They have this, and are happy.
>>> 2. For whatever reason, slave crashes or becomes unavailable.
>>> 3. Master notices no more slaves are available, and operates in
> standalone mode, accumulating WAL files until a suitable slave appears.
>>> 4. Slave finishes rebooting/rebuilding/upgrading/whatever, and
> re-subscribes to the feed.
>>> 5. Slave stays in degraded sync (asynchronous) mode until it is caught
> up, and then switches to synchronous. This makes both master and slave
> happy, because *intent* of synchronous replication is fulfilled.
>>>
>
>> So if I get this straight, what you are saying is "be asynchronous
>> replication unless someone is around, in which case be synchronous" is
>> the mode you want. I think if your goal is zero-transaction loss then
>> you would want to rethink this, and that was the goal of SR: two
>> copies, no matter what, before COMMIT returns from the primary.
>
> For such cases, can there be a way with which an option can be provided to
> user if he wants to change mode to async?

You can already change synchronous_standby_names, and do so without a
restart. That will change between sync and async just fine on a live
system. And you can control that from some external monitor to define
your own rules for exactly when it should drop to async mode.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

From:	Shaun Thomas <sthomas(at)optionshouse(dot)com>
To:	Daniel Farina <daniel(at)heroku(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-10 13:28:46
Message-ID:	4FFC2E0E.2090509@optionshouse.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/10/2012 01:11 AM, Daniel Farina wrote:

> So if I get this straight, what you are saying is "be asynchronous
> replication unless someone is around, in which case be synchronous"
> is the mode you want.

Er, no. I think I see where you might have gotten that, but no.

> This is a pretty tricky definition: consider if you bring a standby
> on-line from archive replay and it shows up in streaming with pretty
> high lag, and stops all commit traffic while it reaches the bounded
> window of what "acceptable" lag is. That sounds pretty terrible, too.
> How does DBRD handle this? It seems like the catchup phase might be
> interesting prior art.

Well, DRBD actually has a very definitive sync mode, and no
"attenuation" is involved at all. Here's what a fully working cluster
looks like, according to /proc/drbd:

cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate

Here's what happens when I disconnect the secondary:

cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown

So there's a few things here:

1. Primary is waiting for the secondary to reconnect.
2. It knows its own data is still up to date.
3. It's waiting to assess the secondary when it re-appears
4. It's still capable of writing to the device.

This is more akin to degraded RAID-1. Writes are synchronous as long as
two devices exist, but if one vanishes, you can still use the disk at
your own risk. Checking the status of DRBD will show this readily. I
also want to point out it is *fully* synchronous when both nodes are
available. I.e., you can't even call a filesystem sync without the sync
succeeding on both nodes.

When you re-connect a secondary device, it catches up as fast as
possible by replaying waiting transactions, and then re-attaching to the
cluster. Until it's fully caught-up, it doesn't exist. DRBD acknowledges
the secondary is there and attempting to catch up, but does not leave
"degraded" mode until the secondary reaches "UpToDate" status.

This is a much more graceful failure scenario than is currently possible
with PostgreSQL. With DRBD, you'd still need a tool to notice the master
node is in an invalid state and perform a failover, but the secondary
going belly-up will not suddenly halt the master.

But I'm not even hoping for *that* level of functionality. I just want
to be able to tell PostgreSQL to notice when the secondary becomes
unavailable *on its own*, and then perform in "degraded non-sync mode"
because it's much faster than any monitor I can possibly attach to
perform the same function. I plan on using DRBD until either PG can do
that, or a better alternative presents itself.

Async is simply too slow for our OLTP system except for the disaster
recovery node, which isn't expected to carry on within seconds of the
primary's failure. I briefly considered sync mode when it appeared as a
feature, but I see it's still too early in its development cycle,
because there are no degraded operation modes. That's fine, I'm willing
to wait.

I just don't understand the push-back, I guess. RAID-1 is the poster
child for synchronous writes for fault tolerance. It will whine
constantly to anyone who will listen when operating only on one device,
but at least it still works. I'm pretty sure nobody would use RAID-1 if
its failure mode was: block writes until someone installs a replacement
disk.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	sthomas(at)optionshouse(dot)com
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-10 14:04:53
Message-ID:	CAC_2qU9WKkxAi5Uqe3EciBKzHMFsrxhf+hG5tur48boWQareSQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 10, 2012 at 9:28 AM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:

> Async is simply too slow for our OLTP system except for the disaster
> recovery node, which isn't expected to carry on within seconds of the
> primary's failure. I briefly considered sync mode when it appeared as a
> feature, but I see it's still too early in its development cycle, because
> there are no degraded operation modes. That's fine, I'm willing to wait.

But this is where some of us are confused with what your asking for.
async is actually *FASTER* than sync. It's got less over head.
Synchrounous replication is basicaly async replication, with an extra
overhead, and an artificial delay on the master for the commit to
*RETURN* to the client. The data is still committed and view able to
new queries on the master, and the slave at the same rate as with
async replication. Just that the commit status returned to the client
is delayed.

So the "async is too slow" is what we don't understand.

> I just don't understand the push-back, I guess. RAID-1 is the poster child
> for synchronous writes for fault tolerance. It will whine constantly to
> anyone who will listen when operating only on one device, but at least it
> still works. I'm pretty sure nobody would use RAID-1 if its failure mode
> was: block writes until someone installs a replacement disk.

I think most of us in the "synchronous replication must be syncronous
replication" camp are there because the guarantees of a simple RAID 1
just isn't good enough for us ;-)

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Shaun Thomas <sthomas(at)optionshouse(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-10 14:31:50
Message-ID:	4FFC3CD6.8050607@optionshouse.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/09/2012 05:15 PM, Josh Berkus wrote:

> "Total-consistency" replication is what I think you want, that is, to
> guarantee that at any given time a read query on the master will return
> the same results as a read query on the standby. Heck, *most* people
> would like to have that. You would also be advancing database science
> in general if you could come up with a way to implement it.

Doesn't having consistent transactional state across the systems imply that?

> So I'm unclear on why sync rep would be faster than async rep given
> that they use exactly the same mechanism. Explain?

Too many mental gymnastics. I get that async is "faster" than sync, but
the inconsistent transactional state makes it *look* slower. If a
customer makes an order, but just happens to check that order state on
the secondary before it can catch up, that's a net loss. Like I said,
that's fine for our DR system, or a reporting mirror, or any one of
several use-case scenarios, but it's not good enough for a failover when
better alternatives exist. In this case, better alternatives are
anything that can guarantee transaction durability: DRBD / PG sync.

PG sync mode does what I want in that regard, it just has no graceful
failure state without relatively invasive intervention. Theoretically we
could write a Pacemaker agent, or some other simple harness, that just
monitors both servers and performs an LSB HUP after modifying the
primary node to disable synchronous_standby_names if the secondary dies,
or promotes the secondary if the primary dies. But after being spoiled
by DRBD knowing the instant the secondary disconnects, but still being
available until the secondary is restored, we can't justifiably switch
to something that will have the primary hang for ten seconds between
monitor checks and service reloads.

I'm just saying I considered it briefly during testing the last few
days, but there's no way I can make a business case for it. PG sync rep
is a great step forward, but it's not for us. Yet.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	sthomas(at)optionshouse(dot)com
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-10 14:40:39
Message-ID:	4FFC3EE7.9010705@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10.07.2012 17:31, Shaun Thomas wrote:
> On 07/09/2012 05:15 PM, Josh Berkus wrote:
>> So I'm unclear on why sync rep would be faster than async rep given
>> that they use exactly the same mechanism. Explain?
>
> Too many mental gymnastics. I get that async is "faster" than sync, but
> the inconsistent transactional state makes it *look* slower. If a
> customer makes an order, but just happens to check that order state on
> the secondary before it can catch up, that's a net loss. Like I said,
> that's fine for our DR system, or a reporting mirror, or any one of
> several use-case scenarios, but it's not good enough for a failover when
> better alternatives exist. In this case, better alternatives are
> anything that can guarantee transaction durability: DRBD / PG sync.
>
> PG sync mode does what I want in that regard, it just has no graceful
> failure state without relatively invasive intervention.

You are mistaken. PostgreSQL's synchronous replication does not
guarantee that the transaction is immediately replayed in the standby.
It only guarantees that it's been sync'd to disk in the standby, but if
there are open snapshots or the system is simply busy, it might takes
minutes or more until the effects of that transaction become visible.

I agree that such a mode would be highly useful, where a transaction is
not acknowledged to the client as committed until it's been replicated
*and* replayed in the standby. And in that mode, a timeout after which
the master just goes ahead without the standby would be useful. You
could then configure your middleware and/or standby to not use the
standby server for queries after that timeout.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Shaun Thomas <sthomas(at)optionshouse(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-10 15:36:11
Message-ID:	4FFC4BEB.9050908@optionshouse.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/10/2012 09:40 AM, Heikki Linnakangas wrote:

> You are mistaken. It only guarantees that it's been sync'd to disk in
> the standby, but if there are open snapshots or the system is simply
> busy, it might takes minutes or more until the effects of that
> transaction become visible.

Well, crap. It's subtle distinctions like this I wish I'd noticed
before. Doesn't really affect our plans, it just makes sync rep even
less viable for our use case. Thanks for the correction! :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

From:	Daniel Farina <daniel(at)heroku(dot)com>
To:	sthomas(at)optionshouse(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-10 15:46:46
Message-ID:	CAAZKuFZ8xTJ=99uKBYF8Bgkg7kd5u7oDRNPKxL85mdLa_4WdfQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 10, 2012 at 6:28 AM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:
> On 07/10/2012 01:11 AM, Daniel Farina wrote:
>
>> So if I get this straight, what you are saying is "be asynchronous
>> replication unless someone is around, in which case be synchronous"
>> is the mode you want.
>
>
> Er, no. I think I see where you might have gotten that, but no.

From your other communications, this sounds like exactly what you
want, because RAID-1 is rather like this: on writes, a degraded RAID-1
needs not wait on its (non-existent) mirror, and can be faster, but
once it has caught up it is not allowed to leave synchronization,
which is slower than writing to one disk alone, since it is the
maximum of the time taken to write to two disks. While in the
degraded state there is effectively only one copy of the data, and
while a mirror rebuild is occurring the replication is effectively
asynchronous to bring it up to date.

--
fdr

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	sthomas(at)optionshouse(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-10 16:57:24
Message-ID:	4FFC5EF4.4090304@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Shaun,

> Too many mental gymnastics. I get that async is "faster" than sync, but
> the inconsistent transactional state makes it *look* slower. If a
> customer makes an order, but just happens to check that order state on
> the secondary before it can catch up, that's a net loss. Like I said,
> that's fine for our DR system, or a reporting mirror, or any one of
> several use-case scenarios, but it's not good enough for a failover when
> better alternatives exist. In this case, better alternatives are
> anything that can guarantee transaction durability: DRBD / PG sync.

Per your exchange with Heikki, that's not actually how SyncRep works in
9.1. So it's not giving you what you want anyway.

This is why we felt that the "sync rep if you can" mode was useless and
didn't accept it into 9.1. The *only* difference between sync rep and
async rep is whether or not the master waits for ack that the standby
has written to log.

I think one of the new modes in 9.2 forces synch-to-DB before ack. No?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	<sthomas(at)optionshouse(dot)com>
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-10 21:42:52
Message-ID:	m2bojnb8df.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Shaun Thomas <sthomas(at)optionshouse(dot)com> writes:
> When you re-connect a secondary device, it catches up as fast as possible by
> replaying waiting transactions, and then re-attaching to the cluster. Until
> it's fully caught-up, it doesn't exist. DRBD acknowledges the secondary is
> there and attempting to catch up, but does not leave "degraded" mode until
> the secondary reaches "UpToDate" status.

That's exactly what happens with PostgreSQL when using asynchronous
replication and archiving. When joining the cluster, the standby will
feed from the archives and then there's nothing recent enough left over
there, and only at this time it will contact the master.

For a real graceful setup you need both archiving and replication.

Then, synchronous replication means that no transaction can make it to
the master alone. The use case is not being allowed to tell the client
it's ok when you're at risk of losing the transaction by crashing the
master when it's the only one knowing about it.

What you explain you want reads to me "Async replication + Archiving".

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Daniel Farina <daniel(at)heroku(dot)com>
To:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc:	sthomas(at)optionshouse(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-10 23:02:43
Message-ID:	CAAZKuFZTZ004Fc=4rkAtZP94bdnw3UfriV6VEj4u5YwaV++mMw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 10, 2012 at 2:42 PM, Dimitri Fontaine
<dimitri(at)2ndquadrant(dot)fr> wrote:>
> What you explain you want reads to me "Async replication + Archiving".

Notable caveat: one can't very easily measure or bound the amount of
transaction loss in any graceful way as-is. We only have "unlimited
lag" and "2-safe or bust".

Presumably the DRBD setup run by the original poster can do this:

* run without a partner in a degraded mode (to use common RAID terminology)

* asynchronous rebuild and catch-up of a new remote RAID partner

* switch to synchronous RAID-1, which attenuates the source of block
device changes to get 2-safe reliability (i.e. blocking on
confirmations from two block devices)

However, the tricky part is what is DRBD's heuristic when suffering
degraded but non-zero performance of the network or block device will
drop attempts to replicate to its partner. Postgres's interpretation
is "halt, because 2-safe is currently impossible." DRBD seems to be
"continue" (but hopefully record a statistic, because who knows how
often you are actually 2-safe, then).

For example, what if DRBD can only complete one page per second for
some reason? Does it it simply have the primary wait at this glacial
pace, or drop synchronous replication and go degraded? Or does it do
something more clever than just a timeout?

These may seem like theoretical concerns, but 'slow, but non-zero'
progress has been an actual thorn in my side many times.

Regardless of what DRBD does, I think the problem with the async/sync
duality as-is is there is no nice way to manage exposure to
transaction loss under various situations and requirements. I'm not
really sure what a solution might look like; I was going to do
something grotesque and conjure carefully orchestrated standby status
packets to accomplish this.

--
fdr

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Daniel Farina <daniel(at)heroku(dot)com>
Cc:	sthomas(at)optionshouse(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-11 10:03:34
Message-ID:	m2liiqaa2x.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Daniel Farina <daniel(at)heroku(dot)com> writes:
> Notable caveat: one can't very easily measure or bound the amount of
> transaction loss in any graceful way as-is. We only have "unlimited
> lag" and "2-safe or bust".

¡per-transaction!

You can change your mind mid-transaction and ask for 2-safe or bust.
That's the detail we've not been talking about in this thread and makes
the whole solution practical in real life, at least for me.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Shaun Thomas <sthomas(at)optionshouse(dot)com>
To:	Daniel Farina <daniel(at)heroku(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-11 13:41:40
Message-ID:	4FFD8294.3050902@optionshouse.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/10/2012 06:02 PM, Daniel Farina wrote:

> For example, what if DRBD can only complete one page per second for
> some reason? Does it it simply have the primary wait at this glacial
> pace, or drop synchronous replication and go degraded? Or does it do
> something more clever than just a timeout?

That's a good question, and way beyond what I know about the internals.
:) In practice though, there are configurable thresholds, and if
exceeded, it will invalidate the secondary. When using Pacemaker, we've
actually had instances where the 10G link we had between the servers
died, so each node thought the other was down. That lead to the
secondary node self-promoting and trying to steal the VIP from the
primary. Throw in a gratuitous arp, and you get a huge mess.

That lead to what DRBD calls split-brain, because both nodes were
running and writing to the block device. Thankfully, you can actually
tell one node to discard its changes and re-subscribe. Doing that will
replay the transactions from the "good" node on the "bad" one. And even
then, it's a good idea to run an online verify to do a block-by-block
checksum and correct any differences.

Of course, all of that's only possible because it's a block-level
replication. I can't even imagine PG doing anything like that. It would
have to know the last good transaction from the primary and do an
implied PIT recovery to reach that state, then re-attach for sync commits.

> Regardless of what DRBD does, I think the problem with the
> async/sync duality as-is is there is no nice way to manage exposure
> to transaction loss under various situations and requirements.

Which would be handy. With synchronous commits, it's given that the
protocol is bi-directional. Then again, PG can detect when clients
disconnect the instant they do so, and having such an event implicitly
disable synchronous_standby_names until reconnect would be an easy fix.
The database already keeps transaction logs, so replaying would still
happen on re-attach. It could easily throw a warning for every
sync-required commit so long as it's in "degraded" mode. Those alone are
very small changes that don't really harm the intent of sync commit.

That's basically what a RAID-1 does, and people have been fine with that
for decades.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	<sthomas(at)optionshouse(dot)com>
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-11 15:19:01
Message-ID:	m2liiq72ca.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Shaun Thomas <sthomas(at)optionshouse(dot)com> writes:
>> Regardless of what DRBD does, I think the problem with the
>> async/sync duality as-is is there is no nice way to manage exposure
>> to transaction loss under various situations and requirements.

Yeah.

> Which would be handy. With synchronous commits, it's given that the protocol
> is bi-directional. Then again, PG can detect when clients disconnect the
> instant they do so, and having such an event implicitly disable

It's not always possible, given how TCP works, if I understand correctly.

> synchronous_standby_names until reconnect would be an easy fix. The database
> already keeps transaction logs, so replaying would still happen on
> re-attach. It could easily throw a warning for every sync-required commit so
> long as it's in "degraded" mode. Those alone are very small changes that
> don't really harm the intent of sync commit.

We already have that, with the archives. The missing piece is how to
apply that to Synchronous Replication…

> That's basically what a RAID-1 does, and people have been fine with that for
> decades.

… and we want to cover *data* availability (durability), not just
service availability.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-11 18:13:58
Message-ID:	4FFDC266.1040304@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 7/11/12 6:41 AM, Shaun Thomas wrote:
> Which would be handy. With synchronous commits, it's given that the
> protocol is bi-directional. Then again, PG can detect when clients
> disconnect the instant they do so, and having such an event implicitly
> disable synchronous_standby_names until reconnect would be an easy fix.
> The database already keeps transaction logs, so replaying would still
> happen on re-attach. It could easily throw a warning for every
> sync-required commit so long as it's in "degraded" mode. Those alone are
> very small changes that don't really harm the intent of sync commit.

So your suggestion is to have a switch "allow degraded", where if the
sync standby doesn't respond within a certain threshold, will switch to
async with a warning for each transaction which asks for sync?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	sthomas(at)optionshouse(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-11 18:49:39
Message-ID:	CA+TgmoZwWYSyEjxYYjRVbhyc+1bZFuu03qfSJs+vi4f7_3bg=g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 10, 2012 at 12:57 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Per your exchange with Heikki, that's not actually how SyncRep works in
> 9.1. So it's not giving you what you want anyway.
>
> This is why we felt that the "sync rep if you can" mode was useless and
> didn't accept it into 9.1. The *only* difference between sync rep and
> async rep is whether or not the master waits for ack that the standby
> has written to log.
>
> I think one of the new modes in 9.2 forces synch-to-DB before ack. No?

No. Such a mode has been discussed and draft patches have been
circulated, but nothing's been committed. The new mode in 9.2 is less
synchronous than the previous mode (wait for remote write rather than
remote fsync), not more.

Now, if we DID have such a mode, then many people would likely attempt
to use synchronous replication in that mode as a way of ensuring that
read queries can't see stale data, rather than as a method of
providing increased durability. And in that case it sure seems like
it would be useful to wait only if the standby is connected. In fact,
you'd almost certainly want to have multiple standbys running
synchronously, and have the ability to wait for only those connected
at the moment. You might also want to have a way for standbys that
lose their connection to the master to refuse to take any new
snapshots until the slave is reconnected and has caught up. Then you
could guarantee that any query run on the slave will see all the
commits that are visible on the master (and possibly more, since
commits become visible on the slave first), which would be useful for
many applications.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
To:	sthomas(at)optionshouse(dot)com
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-12 04:09:06
Message-ID:	CAETJ_S-nJrhYzBc3rwLXSoUxFHgC6rCRwBaHuWqmv4PL60qxmg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

On Wed, Jul 11, 2012 at 9:11 AM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:
> On 07/10/2012 06:02 PM, Daniel Farina wrote:
>
>> For example, what if DRBD can only complete one page per second for
>> some reason? Does it it simply have the primary wait at this glacial
>> pace, or drop synchronous replication and go degraded? Or does it do
>> something more clever than just a timeout?
>
>
> That's a good question, and way beyond what I know about the internals. :)
> In practice though, there are configurable thresholds, and if exceeded, it
> will invalidate the secondary. When using Pacemaker, we've actually had
> instances where the 10G link we had between the servers died, so each node
> thought the other was down. That lead to the secondary node self-promoting
> and trying to steal the VIP from the primary. Throw in a gratuitous arp, and
> you get a huge mess.

That's why Pacemaker *recommends* STONITH (Shoot The Other Node In The
Head). Whenever the standby decides to promote itself, it would just
kill the former master (just in case)... the STONITH thing have to use
an independent connection. Additionally, redundant link between
cluster nodes is a must.

>
> That lead to what DRBD calls split-brain, because both nodes were running
> and writing to the block device. Thankfully, you can actually tell one node
> to discard its changes and re-subscribe. Doing that will replay the
> transactions from the "good" node on the "bad" one. And even then, it's a
> good idea to run an online verify to do a block-by-block checksum and
> correct any differences.
>
> Of course, all of that's only possible because it's a block-level
> replication. I can't even imagine PG doing anything like that. It would have
> to know the last good transaction from the primary and do an implied PIT
> recovery to reach that state, then re-attach for sync commits.
>
>
>> Regardless of what DRBD does, I think the problem with the
>> async/sync duality as-is is there is no nice way to manage exposure
>> to transaction loss under various situations and requirements.
>
>
> Which would be handy. With synchronous commits, it's given that the protocol
> is bi-directional. Then again, PG can detect when clients disconnect the
> instant they do so, and having such an event implicitly disable
> synchronous_standby_names until reconnect would be an easy fix. The database
> already keeps transaction logs, so replaying would still happen on
> re-attach. It could easily throw a warning for every sync-required commit so
> long as it's in "degraded" mode. Those alone are very small changes that
> don't really harm the intent of sync commit.
>
> That's basically what a RAID-1 does, and people have been fine with that for
> decades.
>
>

I can't believe how many times I have seen this topic arise in the
mailing list... I was myself about to start a thread like this!
(thanks Shaun!).

I don't really get what people wants out of the synchronous streaming
replication.... DRBD (that is being used as comparison) in protocol C
is synchronous (it won't confirm a write unless it was written to disk
on both nodes). PostgreSQL (8.4, 9.0, 9.1, ...) will work just fine
with it, except that you don't have a standby that you can connect
to... also, you need to setup a dedicated volume to put the DRBD block
device, setup DRBD, then put the filesystem on top of DRBD, and handle
the DRBD promotion, partition mount (with possible FS error handling),
and then starting PostgreSQL after the FS is correctly mounted......

With synchronous streaming replication you can have about the same:
the standby will have the changes written to disk before master
confirms commit.... I don't really care if standby has already applied
the changes to its DB (although that would certainly be nice).... the
point is: the data is on the standby, and if the master were to crash,
and I were to "promote" the standby: the standby would have the same
commited data the server had before it crashed.

So, why are we, HA people, bothering you DB people so much?: simplify
the things, it is simpler to setup synchronous streaming replication,
than having to setup DRBD + pacemaker rules to make it promote DRBD,
mount FS, and then start pgsql.

Also, there is an great perk to synchronous replication with Hot
Standby: you have a read/only standby that can be used for some things
(even though it doesn't always have exactly the same data as the
master).

I mean, a lot of people here have a really valid point: 2-safe
reliability is great, but how good is it if when you lose it, ALL the
system just freeze? I mean, RAID1 gives you 2-safe reliability, but no
one would use it if the machine were to freeze when you lose 1 disk,
same for DRBD: it offers 2-safe reliability too (at block-level), but
it doesn't freeze if the secondary goes away!

Now, I see some people who are arguing because, apparently,
synchronous replication is not an HA feature (those who says that SR
doesn't fit the HA environment)... please, those people, answer why is
synchronous streaming replication under the High Availability
PostgreSQL manual chapter?

I really feel bad that people are so closed to fix something, I mean:
making the master note that the standby is no longer there and just
fallback to "standalone" mode seems to just bother them so much, that
they wouldn't even allow *an option* to allow that.... we are not
asking you to change default behavior, just add an option that makes
it gracefully continue operation and issue warnings, after all: if you
lose a disk on a RAID array, you get some kind of indication of the
failure to get it fixed ASAP: you know you are in risk until you fix
it, but you can continue to function... name a single RAID controller
that will shutdown your server on single disk failure?, I haven't seen
any card that does that: no body would buy it.

Adding more on a related issue: what's up with the fact that the
standby doesn't respect wal_keep_segments? This is forcing some people
to have to copy the WAL files *twice*: one through streaming
replication, and again to a WAL archive, because if the master dies,
and you have more than one standby (say: 1 synchronous, and 2
asynchronous), you can actually point the async ones to the sync one
once you promote it (as long as you trick the sync one into *not*
switching the timeline, by moving away recovery.conf and restarting,
instead of using "normal" promotion), but if you don't have the WAL
archive, and one of the standbys was too lagged: it wouldn't be able
to recover.

Please, stop arguing on all of this: I don't think that adding an
option will hurt anybody (specially because the work was already done
by someone), we are not asking to change how the things work, we just
want an option to decided whether we want it to freeze on standby
disconnection, or if we want it to continue automatically... is that
asking so much?

Sincerely,

Ildefonso

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-12 04:18:10
Message-ID:	4FFE5002.8000101@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Please, stop arguing on all of this: I don't think that adding an
> option will hurt anybody (specially because the work was already done
> by someone), we are not asking to change how the things work, we just
> want an option to decided whether we want it to freeze on standby
> disconnection, or if we want it to continue automatically... is that
> asking so much?

The objection is that, *given the way synchronous replication currently
works*, having that kind of an option would make the "synchronous"
setting fairly meaningless. The only benefit that synchronous
replication gives you is the guarantee that a write on the master is
also on the standby. If you remove that guarantee, you are using
asynchronous replication, even if the setting says synchronous.

I think what you really want is a separate "auto-degrade" setting. That
is, a setting which says "if no synchronous standby is present,
auto-degrade to async/standalone, and start writing a bunch of warning
messages to the logs and whenever anyone runs a synchronous
transaction". That's an approach which makes some sense, but AFAICT
somewhat different from the proposed patch.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-12 04:37:54
Message-ID:	CAETJ_S_M=D3z05D4Sv1sPsP3ORN+ci3+a9OJ07BjDMyD6WVucg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 11, 2012 at 11:48 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
>> Please, stop arguing on all of this: I don't think that adding an
>> option will hurt anybody (specially because the work was already done
>> by someone), we are not asking to change how the things work, we just
>> want an option to decided whether we want it to freeze on standby
>> disconnection, or if we want it to continue automatically... is that
>> asking so much?
>
> The objection is that, *given the way synchronous replication currently
> works*, having that kind of an option would make the "synchronous"
> setting fairly meaningless. The only benefit that synchronous
> replication gives you is the guarantee that a write on the master is
> also on the standby. If you remove that guarantee, you are using
> asynchronous replication, even if the setting says synchronous.

I know how synchronous replication works, I have read it several
times, I have seen it in the real life, I have seen it in virtual test
environments. And no, it doesn't makes synchronous replication
meaningless, because it will work synchronous if it have someone to
sync to, and work async (or standalone) if it doesn't: that's perfect
for HA environment.

>
> I think what you really want is a separate "auto-degrade" setting. That
> is, a setting which says "if no synchronous standby is present,
> auto-degrade to async/standalone, and start writing a bunch of warning
> messages to the logs and whenever anyone runs a synchronous
> transaction". That's an approach which makes some sense, but AFAICT
> somewhat different from the proposed patch.

Certainly, different to current patch, the one I saw I believe it had
all of that you say there: except the additional warning.

As synchronous standby currently is, it just doesn't fit the HA usage,
and if you really want to keep it that way, it doesn't belong to the
HA chapter on the pgsql documentation, and should be moved. And NO
async replication will *not* work for HA, because the master can have
more transactions than standby, and if the master crashes, the standby
will have no way to recover these transactions, with synchronous
replication we have *exactly* what we need: the data in the standby,
after all, it will apply it once we promote it.

Ildefonso.

From:	Daniel Farina <daniel(at)heroku(dot)com>
To:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc:	sthomas(at)optionshouse(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-12 05:17:57
Message-ID:	CAAZKuFamGJg7NwUsDDAKv+2Or2u7Jtav1DhKY9m0mek8kBK3Mg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 11, 2012 at 3:03 AM, Dimitri Fontaine
<dimitri(at)2ndquadrant(dot)fr> wrote:
> Daniel Farina <daniel(at)heroku(dot)com> writes:
>> Notable caveat: one can't very easily measure or bound the amount of
>> transaction loss in any graceful way as-is. We only have "unlimited
>> lag" and "2-safe or bust".
>
> ¡per-transaction!
>
> You can change your mind mid-transaction and ask for 2-safe or bust.
> That's the detail we've not been talking about in this thread and makes
> the whole solution practical in real life, at least for me.

It's a pretty good feature, but it's pretty dissatisfying that one
cannot have the latency of asynchronous transactions while not
exposing users to unbounded loss as an administrator or provider (as
opposed to a user that sets synchronous commit, as you are saying).

If I had a strong opinion on *how* this should be tunable, I'd voice
it, but I think it's worth insisting that there is a missing part of
this continuum that involves non-zero but not-unbounded risk
management and transaction loss that is under-served. DRBD seems to
have some heuristic that makes people happy that's somewhere
in-between. I'm not saying it should be copied, but the fact it makes
people happy may be worth understanding.

I was quite excited for the syncrep feature because it does open the
door to write those, even if painfully, at all, since we now have both
"unbounded" and "strictly bounded".

--
fdr

From:	Daniel Farina <daniel(at)heroku(dot)com>
To:	sthomas(at)optionshouse(dot)com
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-12 05:31:36
Message-ID:	CAAZKuFaTv2SEfbq64rjP3i+aMcN9EktNEMYVuY2MurLEeYn9tQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 11, 2012 at 6:41 AM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:
>> Regardless of what DRBD does, I think the problem with the
>> async/sync duality as-is is there is no nice way to manage exposure
>> to transaction loss under various situations and requirements.
>
>
> Which would be handy. With synchronous commits, it's given that the protocol
> is bi-directional. Then again, PG can detect when clients disconnect the
> instant they do so, and having such an event implicitly disable
> synchronous_standby_names until reconnect would be an easy fix. The database
> already keeps transaction logs, so replaying would still happen on
> re-attach. It could easily throw a warning for every sync-required commit so
> long as it's in "degraded" mode. Those alone are very small changes that
> don't really harm the intent of sync commit.
>
> That's basically what a RAID-1 does, and people have been fine with that for
> decades.

But RAID-1 as nominally seen is a fundamentally different problem,
with much tinier differences in latency, bandwidth, and connectivity.
Perhaps useful for study, but to suggest the problem is *that* similar
I think is wrong. I think your wording is even more right here than
you suggest: "That's *basically* what a RAID-1 does".

I'm pretty unhappy with many user-facing aspects of this formulation,
even though I think the fundamental need being addressed is
reasonable. But, putting that aside, why not write a piece of
middleware that does precisely this, or whatever you want? It can live
on the same machine as Postgres and ack synchronous commit when nobody
is home, and notify (e.g. page) you in the most precise way you want
if nobody is home "for a while".

--
fdr

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Jose Ildefonso Camargo Tolosa'" <ildefonso(dot)camargo(at)gmail(dot)com>, <sthomas(at)optionshouse(dot)com>
Cc:	"'Daniel Farina'" <daniel(at)heroku(dot)com>, "'Dimitri Fontaine'" <dimitri(at)2ndquadrant(dot)fr>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-12 06:03:26
Message-ID:	001c01cd5ff4$0e2e80c0$2a8b8240$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> From: pgsql-hackers-owner(at)postgresql(dot)org
[mailto:pgsql-hackers-owner(at)postgresql(dot)org]
> On Behalf Of Jose Ildefonso Camargo Tolosa

I think this kind of decision should be done from outside utility or
scripts.
It would be better if from outside it can be detected that stand-by is down
during sync replication, and send command to master to change its mode or
change settings appropriately without stopping master.
Putting this kind of more and more logic into replication code will make it
more cumbersome.

With Regards,
Amit Kapila.

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-12 13:05:51
Message-ID:	m27gu9xh74.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com> writes:
> environments. And no, it doesn't makes synchronous replication
> meaningless, because it will work synchronous if it have someone to
> sync to, and work async (or standalone) if it doesn't: that's perfect
> for HA environment.

You seem to want Service Availibility when we are providing Data
Availibility. I'm not saying you shouldn't ask what you're asking, just
that it is a different need.

If you troll the archives, you will see that this debate has received
much consideration already. The conclusion is that if you care about
Service Availibility you should have 2 standby servers and set them both
as candidates to being the synchronous one.

That way, when you lose one standby the service is unaffected, the
second standby is now the synchronous one, and it's possible to
re-attach the failed standby live, with or without archiving (with is
preferred so that the master isn't involved in the catch-up phase).

> As synchronous standby currently is, it just doesn't fit the HA usage,

It does actually allow both data high availability and service high
availability, provided that you feed at least two standbys.

What you seem to be asking is both data and service high availability
with only two nodes. You're right that we can not provide that with
current releases of PostgreSQL. I'm not sure anyone has a solid plan to
make that happen.

> and if you really want to keep it that way, it doesn't belong to the
> HA chapter on the pgsql documentation, and should be moved. And NO
> async replication will *not* work for HA, because the master can have
> more transactions than standby, and if the master crashes, the standby
> will have no way to recover these transactions, with synchronous
> replication we have *exactly* what we need: the data in the standby,
> after all, it will apply it once we promote it.

Exactly. We want data availability first. Service availability is
important too, and for that you need another standby.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Shaun Thomas <sthomas(at)optionshouse(dot)com>
To:	Daniel Farina <daniel(at)heroku(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-12 13:21:08
Message-ID:	4FFECF44.4010409@optionshouse.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/12/2012 12:31 AM, Daniel Farina wrote:

> But RAID-1 as nominally seen is a fundamentally different problem,
> with much tinier differences in latency, bandwidth, and connectivity.
> Perhaps useful for study, but to suggest the problem is *that* similar
> I think is wrong.

Well, yes and no. One of the reasons I brought up DRBD was because it's
basically RAID-1 over a network interface. It's not without overhead,
but a few basic pgbench tests show it's still 10-15% faster than a
synchronous PG setup for two servers in the same rack. Greg Smith's
tests show that beyond a certain point, a synchronous PG setup
effectively becomes untenable simply due to network latency in the
protocol implementation. In reality, it probably wouldn't be usable
beyond two servers in different datacenters in the same city.

RAID-1 was the model for DRBD, but I brought it up only because it's
pretty much the definition of a synchronous commit that degrades
gracefully. I'd even suggest it's more important in a network context
than for RAID-1, because you're far more likely to get sync
interruptions due to network issues than you are for a disk to fail.

> But, putting that aside, why not write a piece of middleware that
> does precisely this, or whatever you want? It can live on the same
> machine as Postgres and ack synchronous commit when nobody is home,
> and notify (e.g. page) you in the most precise way you want if nobody
> is home "for a while".

You're right that there are lots of ways to kinda get this ability,
they're just not mature enough or capable enough to really matter.
Tailing the log to watch for secondary disconnect is too slow. Monit or
Nagios style checks are too slow and unreliable. A custom-built
middle-layer (a master-slave plugin for Pacemaker, for example) is too
slow. All of these would rely on some kind of check interval. Set that
too high, and we get 10,000xn missed transactions for n seconds. Too
low, and we'd increase the likelihood of false positives and unnecessary
detachments.

If it's possible through a PG 9.x extension, that'd probably be the way
to *safely* handle it as a bolt-on solution. If the original author of
the patch can convert it to such a beast, we'd install it approximately
five seconds after it finished compiling.

So far as transaction durability is concerned... we have a continuous
background rsync over dark fiber for archived transaction logs, DRBD for
block-level sync, filesystem snapshots for our backups, a redundant
async DR cluster, an offsite backup location, and a tape archival
service stretching back for seven years. And none of that will cause the
master to stop processing transactions unless the master itself dies and
triggers a failover.

Using PG sync in its current incarnation would introduce an extra
failure scenario that wasn't there before. I'm pretty sure we're not the
only ones avoiding it for exactly that reason. Our queue discards
messages it can't fulfil within ten seconds and then throws an error for
each one. We need to decouple the secondary as quickly as possible if it
becomes unresponsive, and there's really no way to do that without
something in the database, one way or another.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	sthomas(at)optionshouse(dot)com
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-12 13:58:52
Message-ID:	CAC_2qU9rDFkUMO6ChANQsnsKQN9N0v5mhUru0r6BqowiNPaO=A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 12, 2012 at 9:21 AM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:

> So far as transaction durability is concerned... we have a continuous
> background rsync over dark fiber for archived transaction logs, DRBD for
> block-level sync, filesystem snapshots for our backups, a redundant async DR
> cluster, an offsite backup location, and a tape archival service stretching
> back for seven years. And none of that will cause the master to stop
> processing transactions unless the master itself dies and triggers a
> failover.

Right, so if the dark fiber between New Orleans and Seattle (pick two
places for your datacenter) happens to be the first thing failing in
your NO data center. Disconenct the sync-ness, and continue. Not a
problem, unless it happens to be Aug 29, 2005.

You have lost data. Maybe only a bit. Maybe it wasn't even
important. But that's not for PostgreSQL to decide.

But because your PG on DRDB "continued" when it couldn't replicate to
Seattle, it told it's clients the data was durable, just minutes
before the whole DC was under water.

OK, so a wise admin team would have removed the NO DC from it's
primary role days before that hit.

Change the NO to NYC and the date Sept 11, 2001.

OK, so maybe we can concede that these types of major catasrophies are
more devestating to us than loosing some data.

Now your primary server was in AWS US East last week. It's sync slave
was in the affected AZ, but your PG primary continues on, until, since
it was a EC2 instance, it disappears. Now where is your data?

Or the fire marshall orders the data center (or whole building) EPO,
and the connection to your backup goes down minutes before your
servers or other network peers.

> Using PG sync in its current incarnation would introduce an extra failure
> scenario that wasn't there before. I'm pretty sure we're not the only ones
> avoiding it for exactly that reason. Our queue discards messages it can't
> fulfil within ten seconds and then throws an error for each one. We need to
> decouple the secondary as quickly as possible if it becomes unresponsive,
> and there's really no way to do that without something in the database, one
> way or another.

It introduces an "extra failure", because it has introduce an "extra
data durability guarantee".

Sure, many people don't *really* want that data durability guarantee,
even though they would like the "maybe guaranteed" version of it.

But that fine line is actually a difficult (impossible?) one to define
if you don't know, at the moment of decision, what the next few
moments will/could become.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
Cc:	'Jose Ildefonso Camargo Tolosa' <ildefonso(dot)camargo(at)gmail(dot)com>, sthomas(at)optionshouse(dot)com, 'Daniel Farina' <daniel(at)heroku(dot)com>, 'Dimitri Fontaine' <dimitri(at)2ndquadrant(dot)fr>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-12 16:47:40
Message-ID:	20120712164740.GA11063@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 12, 2012 at 11:33:26AM +0530, Amit Kapila wrote:
> > From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org]
> > On Behalf Of Jose Ildefonso Camargo Tolosa
>
> > Please, stop arguing on all of this: I don't think that adding an
> > option will hurt anybody (specially because the work was already done
> > by someone), we are not asking to change how the things work, we just
> > want an option to decided whether we want it to freeze on standby
> > disconnection, or if we want it to continue automatically... is that
> > asking so much?
>
> I think this kind of decision should be done from outside utility or
> scripts.
> It would be better if from outside it can be detected that stand-by is down
> during sync replication, and send command to master to change its mode or
> change settings appropriately without stopping master.
> Putting this kind of more and more logic into replication code will make it
> more cumbersome.

We certainly would need something external to inform administrators that
the system is no longer synchronous.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Shaun Thomas <sthomas(at)optionshouse(dot)com>
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-12 17:02:23
Message-ID:	20120712170223.GB11063@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 12, 2012 at 08:21:08AM -0500, Shaun Thomas wrote:
> >But, putting that aside, why not write a piece of middleware that
> >does precisely this, or whatever you want? It can live on the same
> >machine as Postgres and ack synchronous commit when nobody is home,
> >and notify (e.g. page) you in the most precise way you want if nobody
> >is home "for a while".
>
> You're right that there are lots of ways to kinda get this ability,
> they're just not mature enough or capable enough to really matter.
> Tailing the log to watch for secondary disconnect is too slow. Monit
> or Nagios style checks are too slow and unreliable. A custom-built
> middle-layer (a master-slave plugin for Pacemaker, for example) is
> too slow. All of these would rely on some kind of check interval.
> Set that too high, and we get 10,000xn missed transactions for n
> seconds. Too low, and we'd increase the likelihood of false
> positives and unnecessary detachments.

Well, the problem also exists if add it as an internal database feature
--- how long do we wait to consider the standby dead, how do we inform
administrators, etc.

I don't think anyone says the feature is useless, but is isn't going to
be a simple boolean either.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Shaun Thomas <sthomas(at)optionshouse(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Daniel Farina <daniel(at)heroku(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-12 20:40:55
Message-ID:	4FFF3657.1010709@optionshouse.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/12/2012 12:02 PM, Bruce Momjian wrote:

> Well, the problem also exists if add it as an internal database
> feature --- how long do we wait to consider the standby dead, how do
> we inform administrators, etc.

True. Though if there is no secondary connected, either because it's not
there yet, or because it disconnected, that's an easy check. It's the
network lag/stall detection that's tricky.

> I don't think anyone says the feature is useless, but is isn't going
> to be a simple boolean either.

Oh $Deity no. I'd never suggest that. I just tend to be overly verbose,
and sometimes my intent gets lost in the rambling as I try to explain my
perspective. I apologize if it somehow came across that anyone could
just flip a switch and have it work.

My C is way too rusty, or I'd be writing an extension right now to do
this, or be looking over that patch I linked to originally to make
suitable adaptations. I know I talk about how relatively handy DRBD is,
but it's also a gigantic PITA since it has to exist underneath the
actual filesystem. :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas(at)optionshouse(dot)com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

From:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
To:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-13 00:27:01
Message-ID:	CAETJ_S9Tr8aFhy9xDKExbawgMdnw8NaFkRKdBDsovU8i6nw+0w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 12, 2012 at 8:35 AM, Dimitri Fontaine
<dimitri(at)2ndquadrant(dot)fr> wrote:
> Hi,
>
> Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com> writes:
>> environments. And no, it doesn't makes synchronous replication
>> meaningless, because it will work synchronous if it have someone to
>> sync to, and work async (or standalone) if it doesn't: that's perfect
>> for HA environment.
>
> You seem to want Service Availibility when we are providing Data
> Availibility. I'm not saying you shouldn't ask what you're asking, just
> that it is a different need.

Yes, and no: I don't see why we can't have and option to choose which
one we want. I can see the point of "data availability": it is better
freeze the service, than risk losing transactions... however, try to
explain that to some managers: "well, you know, the DB server froze
the whole bank system because, well, the standby server died, and we
didn't want to risk transaction loss, we just froze the master.... you
know, in case the master were to die too before the we had a reliable
standby." I don't think a manager would really understand why you
would block the whole company's system, just because *the standby*
server died (and why you don't block it, when the master dies?!).
Now, maybe that's a bad example, I know a bank should have at least 3
or 4 servers, with some of them in different geographical areas, but
just think on the typical boss.

In "Service Availability", you have data Availability most of the
time, until one of the servers fails (if you have just 2 nodes), what
if you have more than two: well, good for you! But, you can keep
going with a single server, understanding that you are in a high risk,
that have to be fixed real soon (emergency).

>
> If you troll the archives, you will see that this debate has received
> much consideration already. The conclusion is that if you care about
> Service Availibility you should have 2 standby servers and set them both
> as candidates to being the synchronous one.

That's more cost, and for most applications: it doesn't worth the extra cost.

Really, I see the point you have, and I have *never* asked to remove
the data warranties, but to have an option to relax it, if the
particular situation requires it: "enough safety" for a given cost.

>
> That way, when you lose one standby the service is unaffected, the
> second standby is now the synchronous one, and it's possible to
> re-attach the failed standby live, with or without archiving (with is
> preferred so that the master isn't involved in the catch-up phase).
>
>> As synchronous standby currently is, it just doesn't fit the HA usage,
>
> It does actually allow both data high availability and service high
> availability, provided that you feed at least two standbys.

Still, doesn't fit. You need to spend more hardware, and more power
(and money there), and more carbon footprint, ..... you get the point,
also, having 3 servers for your DB can be necessary (and possible) for
some companies, but for others: no.

>
> What you seem to be asking is both data and service high availability
> with only two nodes. You're right that we can not provide that with
> current releases of PostgreSQL. I'm not sure anyone has a solid plan to
> make that happen.
>
>> and if you really want to keep it that way, it doesn't belong to the
>> HA chapter on the pgsql documentation, and should be moved. And NO
>> async replication will *not* work for HA, because the master can have
>> more transactions than standby, and if the master crashes, the standby
>> will have no way to recover these transactions, with synchronous
>> replication we have *exactly* what we need: the data in the standby,
>> after all, it will apply it once we promote it.
>
> Exactly. We want data availability first. Service availability is
> important too, and for that you need another standby.

Yeah, you need that with PostgreSQL, but no with DRBD, for example
(sorry, but DRBD is one of the flagships of HA things in the Linux
world). Also, I'm not convinced about the "2nd standby" thing... I
mean, just read this on the docs, which is a little alarming:

"If primary restarts while commits are waiting for acknowledgement,
those waiting transactions will be marked fully committed once the
primary database recovers. There is no way to be certain that all
standbys have received all outstanding WAL data at time of the crash
of the primary. Some transactions may not show as committed on the
standby, even though they show as committed on the primary. The
guarantee we offer is that the application will not receive explicit
acknowledgement of the successful commit of a transaction until the
WAL data is known to be safely received by the standby."

So... there is no *real* warranty here either... I don't know how I
skipped that paragraph before today.... I mean, this implies that it
is possible that a transaction could be marked as commited on the
master, but the app was not informed on that (and thus, could try to
send it again), and the transaction was NOT applied on the standby....
how can this happen? I mean, when the master comes back, shouldn't the
standby get the missing WAL pieces from the master and then apply the
transaction? The standby part is the one that I don't really get, on
the application side... well, there are several ways in which you can
miss the "commit confirmation": connection issues in the worst moment,
and the such, so, I guess it is not *so* serious, and the app should
have a way of checking its last transaction if it lost connectivity to
server before getting the transaction commited.

From:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	sthomas(at)optionshouse(dot)com, Daniel Farina <daniel(at)heroku(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-13 00:38:57
Message-ID:	CAETJ_S-jFznJWX=wZfzwVMF4jxMLmWCH0GVLYW2aQ3sK0JMPoA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 12, 2012 at 9:28 AM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> On Thu, Jul 12, 2012 at 9:21 AM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:
>
>> So far as transaction durability is concerned... we have a continuous
>> background rsync over dark fiber for archived transaction logs, DRBD for
>> block-level sync, filesystem snapshots for our backups, a redundant async DR
>> cluster, an offsite backup location, and a tape archival service stretching
>> back for seven years. And none of that will cause the master to stop
>> processing transactions unless the master itself dies and triggers a
>> failover.
>
> Right, so if the dark fiber between New Orleans and Seattle (pick two
> places for your datacenter) happens to be the first thing failing in
> your NO data center. Disconenct the sync-ness, and continue. Not a
> problem, unless it happens to be Aug 29, 2005.
>
> You have lost data. Maybe only a bit. Maybe it wasn't even
> important. But that's not for PostgreSQL to decide.

I never asked for it... but, you (the one who is configuring the
system) can decide, and should be able to decide... right now: we
can't decide.

>
> But because your PG on DRDB "continued" when it couldn't replicate to
> Seattle, it told it's clients the data was durable, just minutes
> before the whole DC was under water.

Yeah, well, what is the probability of all of that?... really tiny. I
bet it is more likely that you win the lottery, than all of these
events happening within that time frame. But, risking monetary loses
because, for example, the online store stopped accepting orders while
the standby server went down, that's not acceptable for some companies
(and some companies just can't buy 3 x DB servers, or more!).

>
> OK, so a wise admin team would have removed the NO DC from it's
> primary role days before that hit.
>
> Change the NO to NYC and the date Sept 11, 2001.
>
> OK, so maybe we can concede that these types of major catasrophies are
> more devestating to us than loosing some data.
>
> Now your primary server was in AWS US East last week. It's sync slave
> was in the affected AZ, but your PG primary continues on, until, since
> it was a EC2 instance, it disappears. Now where is your data?

Who would *really* trust your PostgreSQL DB to EC2?... I mean, the I/O
is not very good, and the price is not exactly that low so that you
take that risk.

All in all: you are still getting together coincidences that have *so
low* probability....

>
> Or the fire marshall orders the data center (or whole building) EPO,
> and the connection to your backup goes down minutes before your
> servers or other network peers.
>
>> Using PG sync in its current incarnation would introduce an extra failure
>> scenario that wasn't there before. I'm pretty sure we're not the only ones
>> avoiding it for exactly that reason. Our queue discards messages it can't
>> fulfil within ten seconds and then throws an error for each one. We need to
>> decouple the secondary as quickly as possible if it becomes unresponsive,
>> and there's really no way to do that without something in the database, one
>> way or another.
>
> It introduces an "extra failure", because it has introduce an "extra
> data durability guarantee".
>
> Sure, many people don't *really* want that data durability guarantee,
> even though they would like the "maybe guaranteed" version of it.
>
> But that fine line is actually a difficult (impossible?) one to define
> if you don't know, at the moment of decision, what the next few
> moments will/could become.

You *never* know. And the truth is that you have to make the decision
with what you have, if you can pay 10 servers nationwide: good for
you, not all of us can afford that (men, I could barely pay for two,
and that because I *know* I don't want to risk to lose the data or
service because the single server died).

As currently is, the point of: freezing the master because standby
dies is not good for all cases (and I dare say: for most cases), and
having to wait for pacemaker or other monitoring to note that, change
master config and reload... it will cause a service disruption! (for
several seconds, usually, ~30 seconds).

From:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, sthomas(at)optionshouse(dot)com, Daniel Farina <daniel(at)heroku(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-13 00:42:35
Message-ID:	CAETJ_S8uFsFqsAcd6FEsB_J7EgBpFqdDOAxo2V0X-rwe07h=AQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 12, 2012 at 12:17 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Thu, Jul 12, 2012 at 11:33:26AM +0530, Amit Kapila wrote:
>> > From: pgsql-hackers-owner(at)postgresql(dot)org
>> [mailto:pgsql-hackers-owner(at)postgresql(dot)org]
>> > On Behalf Of Jose Ildefonso Camargo Tolosa
>>
>> > Please, stop arguing on all of this: I don't think that adding an
>> > option will hurt anybody (specially because the work was already done
>> > by someone), we are not asking to change how the things work, we just
>> > want an option to decided whether we want it to freeze on standby
>> > disconnection, or if we want it to continue automatically... is that
>> > asking so much?
>>
>> I think this kind of decision should be done from outside utility or
>> scripts.
>> It would be better if from outside it can be detected that stand-by is down
>> during sync replication, and send command to master to change its mode or
>> change settings appropriately without stopping master.
>> Putting this kind of more and more logic into replication code will make it
>> more cumbersome.
>
> We certainly would need something external to inform administrators that
> the system is no longer synchronous.

That is *mandatory*, just as you monitor DRBD, or disk arrays: if a
disk fail, and alert have to be issued, to fix it as soon as possible.

But such alerts can wait 30 seconds to be sent out, so, any monitoring
system would be able to handle that, we just need to get current
system status from the monitoring system, and create corresponding
rules: a simple matter, actually.

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-13 00:59:43
Message-ID:	CAC_2qU89gFBuapR0n2rX1anUzazaQAXoUAdJEDCcfpuc6L_EqA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 12, 2012 at 8:27 PM, Jose Ildefonso Camargo Tolosa

> Yeah, you need that with PostgreSQL, but no with DRBD, for example
> (sorry, but DRBD is one of the flagships of HA things in the Linux
> world). Also, I'm not convinced about the "2nd standby" thing... I
> mean, just read this on the docs, which is a little alarming:
>
> "If primary restarts while commits are waiting for acknowledgement,
> those waiting transactions will be marked fully committed once the
> primary database recovers. There is no way to be certain that all
> standbys have received all outstanding WAL data at time of the crash
> of the primary. Some transactions may not show as committed on the
> standby, even though they show as committed on the primary. The
> guarantee we offer is that the application will not receive explicit
> acknowledgement of the successful commit of a transaction until the
> WAL data is known to be safely received by the standby."
>
> So... there is no *real* warranty here either... I don't know how I
> skipped that paragraph before today.... I mean, this implies that it
> is possible that a transaction could be marked as commited on the
> master, but the app was not informed on that (and thus, could try to
> send it again), and the transaction was NOT applied on the standby....
> how can this happen? I mean, when the master comes back, shouldn't the
> standby get the missing WAL pieces from the master and then apply the
> transaction? The standby part is the one that I don't really get, on
> the application side... well, there are several ways in which you can
> miss the "commit confirmation": connection issues in the worst moment,
> and the such, so, I guess it is not *so* serious, and the app should
> have a way of checking its last transaction if it lost connectivity to
> server before getting the transaction commited.

But you already have that in a single server situation as well. There
is a window between when the commit is "durable" (and visible to
others, and will be committed after recovery of a crash), when the
client doesn't yet know it's committed (and might never get the commit
message due to server crash, network disconnect, client middle-tier
crash, etc).

So people are already susceptible to that, and defending against it, no? ;-)

And they are susceptible to that if they are on PostgreSQL, Oracle, MS
SQL, DB2, etc.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-13 01:26:51
Message-ID:	CAETJ_S84UuyR=tT4ej-TxysTj4rTmAGtieRou+NeyysOnHLPeA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 12, 2012 at 8:29 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> On Thu, Jul 12, 2012 at 8:27 PM, Jose Ildefonso Camargo Tolosa
>
>> Yeah, you need that with PostgreSQL, but no with DRBD, for example
>> (sorry, but DRBD is one of the flagships of HA things in the Linux
>> world). Also, I'm not convinced about the "2nd standby" thing... I
>> mean, just read this on the docs, which is a little alarming:
>>
>> "If primary restarts while commits are waiting for acknowledgement,
>> those waiting transactions will be marked fully committed once the
>> primary database recovers. There is no way to be certain that all
>> standbys have received all outstanding WAL data at time of the crash
>> of the primary. Some transactions may not show as committed on the
>> standby, even though they show as committed on the primary. The
>> guarantee we offer is that the application will not receive explicit
>> acknowledgement of the successful commit of a transaction until the
>> WAL data is known to be safely received by the standby."
>>
>> So... there is no *real* warranty here either... I don't know how I
>> skipped that paragraph before today.... I mean, this implies that it
>> is possible that a transaction could be marked as commited on the
>> master, but the app was not informed on that (and thus, could try to
>> send it again), and the transaction was NOT applied on the standby....
>> how can this happen? I mean, when the master comes back, shouldn't the
>> standby get the missing WAL pieces from the master and then apply the
>> transaction? The standby part is the one that I don't really get, on
>> the application side... well, there are several ways in which you can
>> miss the "commit confirmation": connection issues in the worst moment,
>> and the such, so, I guess it is not *so* serious, and the app should
>> have a way of checking its last transaction if it lost connectivity to
>> server before getting the transaction commited.
>
> But you already have that in a single server situation as well. There
> is a window between when the commit is "durable" (and visible to
> others, and will be committed after recovery of a crash), when the
> client doesn't yet know it's committed (and might never get the commit
> message due to server crash, network disconnect, client middle-tier
> crash, etc).
>
> So people are already susceptible to that, and defending against it, no? ;-)

Right. What I'm saying is that particular part on the docs:

"If primary restarts while commits are waiting for acknowledgement,
those waiting transactions will be marked fully committed once the
primary database recovers. "(....)"Some transactions may not show as
committed on the standby, even though they show as committed on the
primary."(...)

See? it sounds like, after the primary database recovers, the standby
will still not have the transaction committed, and as far as I thought
I knew, the standby should get that over the WAL stream from master
once it reconnects to it.

>
> And they are susceptible to that if they are on PostgreSQL, Oracle, MS
> SQL, DB2, etc.

Certainly. That's why I said:

(...)"The standby part is the one that I don't really get, on
the application side... well, there are several ways in which you can
miss the "commit confirmation": connection issues in the worst moment,
and the such, so, I guess it is not *so* serious, and the app should
have a way of checking its last transaction if it lost connectivity to
server before getting the transaction commited."

From:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
To:	sthomas(at)optionshouse(dot)com
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Daniel Farina <daniel(at)heroku(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-13 01:36:26
Message-ID:	CAETJ_S8Q4abZJ2wv=Me8oMR--N9GzZfGPOqcbhMePAk+djHgBA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 12, 2012 at 4:10 PM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:
> On 07/12/2012 12:02 PM, Bruce Momjian wrote:
>
>> Well, the problem also exists if add it as an internal database
>> feature --- how long do we wait to consider the standby dead, how do
>> we inform administrators, etc.
>
>
> True. Though if there is no secondary connected, either because it's not
> there yet, or because it disconnected, that's an easy check. It's the
> network lag/stall detection that's tricky.

Well, yes... but how does PostgreSQL currently note its "main
synchronous standby" went away and that it have to use another standby
and synchronous? How long does it takes it to note that?

>
>
>> I don't think anyone says the feature is useless, but is isn't going
>> to be a simple boolean either.
>
>
> Oh $Deity no. I'd never suggest that. I just tend to be overly verbose, and
> sometimes my intent gets lost in the rambling as I try to explain my
> perspective. I apologize if it somehow came across that anyone could just
> flip a switch and have it work.
>
> My C is way too rusty, or I'd be writing an extension right now to do this,
> or be looking over that patch I linked to originally to make suitable
> adaptations. I know I talk about how relatively handy DRBD is, but it's also
> a gigantic PITA since it has to exist underneath the actual filesystem. :)
>
>
> --
> Shaun Thomas
> OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
> 312-444-8534
> sthomas(at)optionshouse(dot)com
>
>
>
> ______________________________________________
>
> See http://www.peak6.com/email_disclaimer/ for terms and conditions related
> to this email
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

From:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To:	"'Jose Ildefonso Camargo Tolosa'" <ildefonso(dot)camargo(at)gmail(dot)com>, "'Aidan Van Dyk'" <aidan(at)highrise(dot)ca>
Cc:	<sthomas(at)optionshouse(dot)com>, "'Daniel Farina'" <daniel(at)heroku(dot)com>, "'Dimitri Fontaine'" <dimitri(at)2ndquadrant(dot)fr>, <pgsql-hackers(at)postgresql(dot)org>, "'Bruce Momjian'" <bruce(at)momjian(dot)us>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-13 04:55:37
Message-ID:	002b01cd60b3$bf3f1f90$3dbd5eb0$@kapila@huawei.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> From: pgsql-hackers-owner(at)postgresql(dot)org
[mailto:pgsql-hackers-owner(at)postgresql(dot)org]
> On Behalf Of Jose Ildefonso Camargo Tolosa
>>On Thu, Jul 12, 2012 at 9:28 AM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> On Thu, Jul 12, 2012 at 9:21 AM, Shaun Thomas <sthomas(at)optionshouse(dot)com>
wrote:
>

> As currently is, the point of: freezing the master because standby
> dies is not good for all cases (and I dare say: for most cases), and
> having to wait for pacemaker or other monitoring to note that, change
> master config and reload... it will cause a service disruption! (for
> several seconds, usually, ~30 seconds).

Yes, this is true that it can cause service disruption, but the same will be
True even if master detects that internally by having timeout.
By keeping this as external, the current behavior of PostgreSQL can be
maintained that
if there is no standy in sync mode, it will wait and still serve the purpose
as externally it can send message for master.

From:	Hampus Wessman <hampus(at)hampuswessman(dot)se>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-13 07:12:56
Message-ID:	4FFFCA78.6050906@hampuswessman.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi all,

Here are some (slightly too long) thoughts about this.

Shaun Thomas skrev 2012-07-12 22:40:
> On 07/12/2012 12:02 PM, Bruce Momjian wrote:
>
>> Well, the problem also exists if add it as an internal database
>> feature --- how long do we wait to consider the standby dead, how do
>> we inform administrators, etc.
>
> True. Though if there is no secondary connected, either because it's not
> there yet, or because it disconnected, that's an easy check. It's the
> network lag/stall detection that's tricky.

It is indeed tricky to detect this. If you don't get an (immediate)
reply from the secondary (and you never do!), then all you can do is
wait and *eventually* (after how long? 250ms? 10s?) assume that there is
no connection between them. The conclusion may very well be wrong
sometimes. A second problem is that we still don't know if this is
caused by some kind of network problems or if it's caused by the
secondary not running. It's perfectly possible that both servers are
working, but just can't communicate at the moment.

The thing is that what we do next (at least if our data is important and
why otherwise use synchronous replication of any kind...) depends on
what *did* happen. Assume that we have two database servers. At any time
we need at most one primary database to be running. Without that
requirement our data can get messed up completely... If HA is important
to us, we may choose to do a failover to the secondary (and live without
replication for the moment) if the primary fails. With synchronous
repliction, we can do this without losing any data. If the secondary
also dies, then we do lose data (and we'll know it!), but it might be an
acceptable risk. If the secondary isn't permanently damaged, then we
might even be able to get the data back after some down time. Ok, so
that's one way to reconfigure the database servers on a failure. If the
secondary fails instead, then we can do similarly and remove it from the
"cluster" (or in other words, disable synchronous replication to the
secondary). Again, we don't lose any data by doing this. We're taking a
certain risk, however. We can't safely do a failover to the secondary
anymore... So if the primary fails now, then the only way not to lose
data is to hope that we can get it back from the failed machine (the
failure may be temporary).

There's also the third possibility, of course, that the two servers are
both up and running, but they can't communicate over the network at the
moment (this is, by the way, a difference from RAID, I guess). What do
we do then? Well, we still need at most one primary database server.
We'll have to (somehow, which doesn't matter as much) decide which
database to keep and consider the other one "down". Then we can just do
as above (with all the same implications!). Is it always a good idea to
keep the primary? No! What if you (as a stupid example) pull the network
cable from the primary (or maybe turn off a switch so that it's isolated
from most of the network)? In that case you probably want the secondary
to take over instead. At least if you value service availability. At
this point we can still do a safe failover too.

My point here is that if HA is important to you, then you may very well
want to disable synchronous replication on a failure to avoid down time,
but this has to be integrated with your overall failover / cluster
management solution. Just having the primary automatically disable
synchronous replication doesn't seem overly useful to me... If you're
using synchronous replication to begin with, you probably want to *know*
if you may have lost data or not. Otherwise, you will have to assume
that you did and then you could frankly have been running async
replication all along. If you do integrate it with your failover
solution, then you can keep track of when it's safe to do a failover and
when it's not, however, and decide how to handle each case.

How you decide what to do with the servers on failures isn't that
important here, really. You can probably run e.g. Pacemaker on 3+
machines and have it check for quorums to accomplish this. That's a good
approach at least. You can still have only 2 database servers (for cost
reasons), if you want. PostgreSQL could have all this built-in, but I
don't think it sounds overly useful to only be able to disable
synchronous replication on the primary after a timeout. Then you can
never safely do a failover to the secondary, because you can't be sure
synchronous replication was active on the failed primary...

Regards,
Hampus

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Hampus Wessman <hampus(at)hampuswessman(dot)se>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-13 14:52:43
Message-ID:	20120713145243.GB15443@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote:
> How you decide what to do with the servers on failures isn't that
> important here, really. You can probably run e.g. Pacemaker on 3+
> machines and have it check for quorums to accomplish this. That's a
> good approach at least. You can still have only 2 database servers
> (for cost reasons), if you want. PostgreSQL could have all this
> built-in, but I don't think it sounds overly useful to only be able
> to disable synchronous replication on the primary after a timeout.
> Then you can never safely do a failover to the secondary, because
> you can't be sure synchronous replication was active on the failed
> primary...

So how about this for a Postgres TODO:

Add configuration variable to allow Postgres to disable synchronous
replication after a specified timeout, and add variable to alert
administrators of the change.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, sthomas(at)optionshouse(dot)com, Daniel Farina <daniel(at)heroku(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-13 23:38:19
Message-ID:	CAETJ_S-h8t5_S_1WYE0EORvDTKgo-8qYp2Gd+R7onSGVdtgO_w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 13, 2012 at 12:25 AM, Amit Kapila <amit(dot)kapila(at)huawei(dot)com> wrote:
>
>> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org]
>> On Behalf Of Jose Ildefonso Camargo Tolosa
>>>On Thu, Jul 12, 2012 at 9:28 AM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
>> On Thu, Jul 12, 2012 at 9:21 AM, Shaun Thomas <sthomas(at)optionshouse(dot)com>
> wrote:
>>
>
>> As currently is, the point of: freezing the master because standby
>> dies is not good for all cases (and I dare say: for most cases), and
>> having to wait for pacemaker or other monitoring to note that, change
>> master config and reload... it will cause a service disruption! (for
>> several seconds, usually, ~30 seconds).
>
> Yes, this is true that it can cause service disruption, but the same will be
> True even if master detects that internally by having timeout.
> By keeping this as external, the current behavior of PostgreSQL can be
> maintained that
> if there is no standy in sync mode, it will wait and still serve the purpose
> as externally it can send message for master.
>

How does currently PostgreSQL detects that its main synchronous
standby went away and switch to another synchronous standby on the
synchronous_standby_names config parameter?

The same logic could be applied to "no more synchronous standbys: go
into standalone" (optionally).

--
Ildefonso Camargo
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC
@cmdpromptinc - 509-416-6579

From:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
To:	Hampus Wessman <hampus(at)hampuswessman(dot)se>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-14 00:09:29
Message-ID:	CAETJ_S8FbJNRgcZMoomQvzyrYxEquhF91ghpkQQyoEZyhUp99w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Hampus,

On Fri, Jul 13, 2012 at 2:42 AM, Hampus Wessman <hampus(at)hampuswessman(dot)se> wrote:
> Hi all,
>
> Here are some (slightly too long) thoughts about this.

Nah, not that long.

>
> Shaun Thomas skrev 2012-07-12 22:40:
>
>> On 07/12/2012 12:02 PM, Bruce Momjian wrote:
>>
>>> Well, the problem also exists if add it as an internal database
>>> feature --- how long do we wait to consider the standby dead, how do
>>> we inform administrators, etc.
>>
>>
>> True. Though if there is no secondary connected, either because it's not
>> there yet, or because it disconnected, that's an easy check. It's the
>> network lag/stall detection that's tricky.
>
>
> It is indeed tricky to detect this. If you don't get an (immediate) reply
> from the secondary (and you never do!), then all you can do is wait and
> *eventually* (after how long? 250ms? 10s?) assume that there is no
> connection between them. The conclusion may very well be wrong sometimes. A
> second problem is that we still don't know if this is caused by some kind of
> network problems or if it's caused by the secondary not running. It's
> perfectly possible that both servers are working, but just can't communicate
> at the moment.

How about: same logic as it currently uses to detect when the
"designated" synchronous standby is no longer there, and move on to
the next one on the synchronous_standby_names?

The rule to *know* that a standby went away is already there.

>
> The thing is that what we do next (at least if our data is important and why
> otherwise use synchronous replication of any kind...) depends on what *did*
> happen. Assume that we have two database servers. At any time we need at
> most one primary database to be running. Without that requirement our data
> can get messed up completely... If HA is important to us, we may choose to

Not necessarily, but true: that's why you use to kill the (failing?)
node on promotion of the standby, just in case.

> do a failover to the secondary (and live without replication for the moment)
> if the primary fails. With synchronous repliction, we can do this without
> losing any data. If the secondary also dies, then we do lose data (and we'll
> know it!), but it might be an acceptable risk. If the secondary isn't
> permanently damaged, then we might even be able to get the data back after
> some down time. Ok, so that's one way to reconfigure the database servers on
> a failure. If the secondary fails instead, then we can do similarly and
> remove it from the "cluster" (or in other words, disable synchronous
> replication to the secondary). Again, we don't lose any data by doing this.

Right, but you have to monitor the standby too! ie: more work on the
pacemaker side..... and non-trivial work, for example, just blowing
away the standby won't do any good here, as for the master: you can
just power it off, promote the standby, and be done with it!, if the
standby fails: you have to modify master's config, and reload configs
there... more code: more chances of failure.

> We're taking a certain risk, however. We can't safely do a failover to the
> secondary anymore... So if the primary fails now, then the only way not to
> lose data is to hope that we can get it back from the failed machine (the
> failure may be temporary).
>
> There's also the third possibility, of course, that the two servers are both
> up and running, but they can't communicate over the network at the moment
> (this is, by the way, a difference from RAID, I guess). What do we do then?

Kill the "failing" node, just in case, in this case, without the
"extra" work of monitoring standby, you would just make the standby
kill the master before promoting the standby.

> Well, we still need at most one primary database server. We'll have to
> (somehow, which doesn't matter as much) decide which database to keep and
> consider the other one "down". Then we can just do as above (with all the

This is arbitrary, we usually just assume the master to be failing
when the standby is healthy (from the standby point of view).

> same implications!). Is it always a good idea to keep the primary? No! What
> if you (as a stupid example) pull the network cable from the primary (or
> maybe turn off a switch so that it's isolated from most of the network)? In

That means that you failed to have redundant connectivity to the
standby (that is a must on clusters), yes, redundant switch too: with
"smart switches" on the <US$100 range now, there is no much excuse for
not having 2 switches connecting your cluster (but, if you have just 2
nodes, you just need 2 network interfaces, and 2 network cables).

> that case you probably want the secondary to take over instead. At least if
> you value service availability. At this point we can still do a safe
> failover too.
>
> My point here is that if HA is important to you, then you may very well want
> to disable synchronous replication on a failure to avoid down time, but this
> has to be integrated with your overall failover / cluster management
> solution. Just having the primary automatically disable synchronous

That's not a trivial matter, you have to monitor the standby, and make
changes on the master configuration.

> replication doesn't seem overly useful to me... If you're using synchronous
> replication to begin with, you probably want to *know* if you may have lost
> data or not. Otherwise, you will have to assume that you did and then you

Right, and you would know, when the standby node (or service) goes
down, the monitoring system can inform you.. but it doesn't have to
change master's config.

> could frankly have been running async replication all along. If you do

No, you can't, because the 99.9% of the time when standby is healthy
and connected, you are at risk of losing transactions if you run async
replication.

> integrate it with your failover solution, then you can keep track of when
> it's safe to do a failover and when it's not, however, and decide how to
> handle each case.

Of course you can, but it is more complex, and likely slower. For
example, if master detects that standby disconnected: TCP connection
was closed, it can just fallback to async while it comes back, then
pass through the "catch-up" process when it comes back, and go back to
sync. The monitor will likely take, at the very least, 1 second (up
to 30 seconds, on most configurations) to realize, make the change,
and then reload master's config.

See, the main problem here is that, with current PostgreSQL behavior,
you have doubled the chances of service disruption: if master fails,
there is the time the cluster takes to note it, and bring standby up
(and kill master, likely), AND if standby fails, there is the time the
cluster takes to note it, change configs on master, and reload.

>
> How you decide what to do with the servers on failures isn't that important
> here, really. You can probably run e.g. Pacemaker on 3+ machines and have it
> check for quorums to accomplish this. That's a good approach at least. You
> can still have only 2 database servers (for cost reasons), if you want.
> PostgreSQL could have all this built-in, but I don't think it sounds overly
> useful to only be able to disable synchronous replication on the primary
> after a timeout. Then you can never safely do a failover to the secondary,
> because you can't be sure synchronous replication was active on the failed
> primary...

Or have a mixed cluster of application servers and DB servers, and
have them support each other for quorum.

And no, not after a timeout: immediately if TCP socket is closed, or
with the same logic as it "switches" to other sync standby otherwise.

From:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Hampus Wessman <hampus(at)hampuswessman(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-14 00:38:59
Message-ID:	CAETJ_S_6oXxpu9RBkD_8o2TwuHajfcveat90k1DXbhrdgwVXug@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 13, 2012 at 10:22 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote:
>> How you decide what to do with the servers on failures isn't that
>> important here, really. You can probably run e.g. Pacemaker on 3+
>> machines and have it check for quorums to accomplish this. That's a
>> good approach at least. You can still have only 2 database servers
>> (for cost reasons), if you want. PostgreSQL could have all this
>> built-in, but I don't think it sounds overly useful to only be able
>> to disable synchronous replication on the primary after a timeout.
>> Then you can never safely do a failover to the secondary, because
>> you can't be sure synchronous replication was active on the failed
>> primary...
>
> So how about this for a Postgres TODO:
>
> Add configuration variable to allow Postgres to disable synchronous
> replication after a specified timeout, and add variable to alert
> administrators of the change.

I agree we need a TODO for this, but... I think timeout-only is not
the best choice, there should be a maximum timeout (as a last
resource: the maximum time we are willing to wait for standby, this
have to have the option of "forever"), but certainly PostgreSQL have
to detect the *complete* disconnection of the standby (or all standbys
on the synchronous_standby_names), if it detects that no standbys are
eligible for sync standby AND the option to do fallback to async is
enabled = it will go into standalone mode (as if
synchronous_standby_names were empty), otherwise (if option is
disabled) it will just continue to wait for ever (the "last resource"
timeout is ignored if the fallback option is disabled).... I would
call this "soft_synchronous_standby", and
"soft_synchronous_standby_timeout" (in seconds, 0=forever, a sane
value would be ~5 seconds) or something like that (I'm quite bad at
picking names :( ).

From:	Amit kapila <amit(dot)kapila(at)huawei(dot)com>
To:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Hampus Wessman <hampus(at)hampuswessman(dot)se>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-14 03:42:42
Message-ID:	6C0B27F7206C9E4CA54AE035729E9C382851D989@szxeml509-mbs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

From: pgsql-hackers-owner(at)postgresql(dot)org [pgsql-hackers-owner(at)postgresql(dot)org] on behalf of Jose Ildefonso Camargo Tolosa [ildefonso(dot)camargo(at)gmail(dot)com]
Sent: Saturday, July 14, 2012 6:08 AM
On Fri, Jul 13, 2012 at 10:22 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote:
>
>> So how about this for a Postgres TODO:
>>
>> Add configuration variable to allow Postgres to disable synchronous
>> replication after a specified timeout, and add variable to alert
>> administrators of the change.

> I agree we need a TODO for this, but... I think timeout-only is not
> the best choice, there should be a maximum timeout (as a last
> resource: the maximum time we are willing to wait for standby, this
> have to have the option of "forever"), but certainly PostgreSQL have
> to detect the *complete* disconnection of the standby (or all standbys
> on the synchronous_standby_names), if it detects that no standbys are
> eligible for sync standby AND the option to do fallback to async is
> enabled = it will go into standalone mode (as if
> synchronous_standby_names were empty), otherwise (if option is
> disabled) it will just continue to wait for ever (the "last resource"
> timeout is ignored if the fallback option is disabled).... I would
> call this "soft_synchronous_standby", and
> "soft_synchronous_standby_timeout" (in seconds, 0=forever, a sane
> value would be ~5 seconds) or something like that (I'm quite bad at
> picking names :( ).

After it has gone to standalone mode, if the standby came back will it be able to return back to sync mode with it.
If not, then won't it break the current behavior, as currently I think in freeze mode if the standby came back, the sync mode replication
can again start.

With Regards,
Amit Kapila.

From:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
To:	Amit kapila <amit(dot)kapila(at)huawei(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Hampus Wessman <hampus(at)hampuswessman(dot)se>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-14 04:06:38
Message-ID:	CAETJ_S80_2qON-+Q_FxuE6Tn58wd7rF+vth_U7Cth+Q4aQABXg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 13, 2012 at 11:12 PM, Amit kapila <amit(dot)kapila(at)huawei(dot)com> wrote:
> From: pgsql-hackers-owner(at)postgresql(dot)org [pgsql-hackers-owner(at)postgresql(dot)org] on behalf of Jose Ildefonso Camargo Tolosa [ildefonso(dot)camargo(at)gmail(dot)com]
> Sent: Saturday, July 14, 2012 6:08 AM
> On Fri, Jul 13, 2012 at 10:22 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote:
>>
>>> So how about this for a Postgres TODO:
>>>
>>> Add configuration variable to allow Postgres to disable synchronous
>>> replication after a specified timeout, and add variable to alert
>>> administrators of the change.
>
>> I agree we need a TODO for this, but... I think timeout-only is not
>> the best choice, there should be a maximum timeout (as a last
>> resource: the maximum time we are willing to wait for standby, this
>> have to have the option of "forever"), but certainly PostgreSQL have
>> to detect the *complete* disconnection of the standby (or all standbys
>> on the synchronous_standby_names), if it detects that no standbys are
>> eligible for sync standby AND the option to do fallback to async is
>> enabled = it will go into standalone mode (as if
>> synchronous_standby_names were empty), otherwise (if option is
>> disabled) it will just continue to wait for ever (the "last resource"
>> timeout is ignored if the fallback option is disabled).... I would
>> call this "soft_synchronous_standby", and
>> "soft_synchronous_standby_timeout" (in seconds, 0=forever, a sane
>> value would be ~5 seconds) or something like that (I'm quite bad at
>> picking names :( ).
>
> After it has gone to standalone mode, if the standby came back will it be able to return back to sync mode with it.

That's the idea, yes, after the standby comes back, the master would
act as if the sync standby connected for the first time: first going
through the "catchup" mode, and "once the lag between standby and
primary reaches zero "(...)" we move to real-time streaming state"
(from 9.1 docs), at that point: normal sync behavior is restored.

From:	Amit kapila <amit(dot)kapila(at)huawei(dot)com>
To:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Hampus Wessman <hampus(at)hampuswessman(dot)se>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-14 05:12:08
Message-ID:	6C0B27F7206C9E4CA54AE035729E9C382851E013@szxeml509-mbs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> From: Jose Ildefonso Camargo Tolosa [ildefonso(dot)camargo(at)gmail(dot)com]
> Sent: Saturday, July 14, 2012 9:36 AM
>On Fri, Jul 13, 2012 at 11:12 PM, Amit kapila <amit(dot)kapila(at)huawei(dot)com> wrote:
> From: pgsql-hackers-owner(at)postgresql(dot)org [pgsql-hackers-owner(at)postgresql(dot)org] on behalf of Jose Ildefonso Camargo Tolosa [ildefonso(dot)camargo(at)gmail(dot)com]
> Sent: Saturday, July 14, 2012 6:08 AM
> On Fri, Jul 13, 2012 at 10:22 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote:
>>
>>>> So how about this for a Postgres TODO:
>>>>
>>>> Add configuration variable to allow Postgres to disable synchronous
>>>> replication after a specified timeout, and add variable to alert
>>>> administrators of the change.
>
>>> I agree we need a TODO for this, but... I think timeout-only is not
>>> the best choice, there should be a maximum timeout (as a last
>>> resource: the maximum time we are willing to wait for standby, this
>>> have to have the option of "forever"), but certainly PostgreSQL have
>>> to detect the *complete* disconnection of the standby (or all standbys
>>> on the synchronous_standby_names), if it detects that no standbys are
>>> eligible for sync standby AND the option to do fallback to async is
>>> enabled = it will go into standalone mode (as if
>>> synchronous_standby_names were empty), otherwise (if option is
>>> disabled) it will just continue to wait for ever (the "last resource"
>>> timeout is ignored if the fallback option is disabled).... I would
>>> call this "soft_synchronous_standby", and
>>> "soft_synchronous_standby_timeout" (in seconds, 0=forever, a sane
>>> value would be ~5 seconds) or something like that (I'm quite bad at
>>> picking names :( ).
>
> >After it has gone to standalone mode, if the standby came back will it be able to return back to sync mode with it.

> That's the idea, yes, after the standby comes back, the master would
> act as if the sync standby connected for the first time: first going
> through the "catchup" mode, and "once the lag between standby and
> primary reaches zero "(...)" we move to real-time streaming state"
> (from 9.1 docs), at that point: normal sync behavior is restored.

Idea wise, it looks okay, but are you sure that in the current code/design, it can handle the way you are suggesting.
I am not sure it can work because it might be the case that due to network instability, the master has gone in standalone mode
and now after standy is able to communicate back, it might be expecting to get more data rather than go in cacthup mode.
I believe some person who is expert of this code area can comment here to make it more concrete.

With Regards,
Amit Kapila.

From:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
To:	Amit kapila <amit(dot)kapila(at)huawei(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Hampus Wessman <hampus(at)hampuswessman(dot)se>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-14 14:12:09
Message-ID:	CAETJ_S_GReZ05SCy=dzAGN5+KAQ5gGmS5q-v2D7fU0_PkGJmtg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jul 14, 2012 at 12:42 AM, Amit kapila <amit(dot)kapila(at)huawei(dot)com> wrote:
>> From: Jose Ildefonso Camargo Tolosa [ildefonso(dot)camargo(at)gmail(dot)com]
>> Sent: Saturday, July 14, 2012 9:36 AM
>>On Fri, Jul 13, 2012 at 11:12 PM, Amit kapila <amit(dot)kapila(at)huawei(dot)com> wrote:
>> From: pgsql-hackers-owner(at)postgresql(dot)org [pgsql-hackers-owner(at)postgresql(dot)org] on behalf of Jose Ildefonso Camargo Tolosa [ildefonso(dot)camargo(at)gmail(dot)com]
>> Sent: Saturday, July 14, 2012 6:08 AM
>> On Fri, Jul 13, 2012 at 10:22 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>>> On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote:
>>>
>>>>> So how about this for a Postgres TODO:
>>>>>
>>>>> Add configuration variable to allow Postgres to disable synchronous
>>>>> replication after a specified timeout, and add variable to alert
>>>>> administrators of the change.
>>
>>>> I agree we need a TODO for this, but... I think timeout-only is not
>>>> the best choice, there should be a maximum timeout (as a last
>>>> resource: the maximum time we are willing to wait for standby, this
>>>> have to have the option of "forever"), but certainly PostgreSQL have
>>>> to detect the *complete* disconnection of the standby (or all standbys
>>>> on the synchronous_standby_names), if it detects that no standbys are
>>>> eligible for sync standby AND the option to do fallback to async is
>>>> enabled = it will go into standalone mode (as if
>>>> synchronous_standby_names were empty), otherwise (if option is
>>>> disabled) it will just continue to wait for ever (the "last resource"
>>>> timeout is ignored if the fallback option is disabled).... I would
>>>> call this "soft_synchronous_standby", and
>>>> "soft_synchronous_standby_timeout" (in seconds, 0=forever, a sane
>>>> value would be ~5 seconds) or something like that (I'm quite bad at
>>>> picking names :( ).
>>
>> >After it has gone to standalone mode, if the standby came back will it be able to return back to sync mode with it.
>
>> That's the idea, yes, after the standby comes back, the master would
>> act as if the sync standby connected for the first time: first going
>> through the "catchup" mode, and "once the lag between standby and
>> primary reaches zero "(...)" we move to real-time streaming state"
>> (from 9.1 docs), at that point: normal sync behavior is restored.
>
> Idea wise, it looks okay, but are you sure that in the current code/design, it can handle the way you are suggesting.
> I am not sure it can work because it might be the case that due to network instability, the master has gone in standalone mode
> and now after standy is able to communicate back, it might be expecting to get more data rather than go in cacthup mode.
> I believe some person who is expert of this code area can comment here to make it more concrete.

Well, I'd need to dive into the code, but as far as I know, is the
master who decides to be on "catchup" mode, and standby just takes
care of sending feedback to master. Also, it has to handle the
situation, because currently, if master goes away because it crashed,
or because of network issues, the standby doesn't really know why, and
will reconnect to master and do whatever it needs to do to get in sync
with master again (be it: try to reconnect several times while master
is restarting, or that it just reconnect to a waiting master, and
request pending WAL segments). There have to be code in place to
handle those issues, because it is already working. I'm trying to get
a solution that is as non-intrusive as possible, with lower amount of
code added, so that performance doesn't suffer by reusing current
logic and actions, with small alterations.

>
> With Regards,
> Amit Kapila.

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-14 23:54:22
Message-ID:	500206AE.1090002@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

So, here's the core issue with degraded mode. I'm not mentioning this
to block any patch anyone has, but rather out of a desire to see someone
address this core problem with some clever idea I've not thought of.
The problem in a nutshell is: indeterminancy.

Assume someone implements degraded mode. Then:

1. Master has one synchronous standby, Standby1, and two asynchronous,
Standby2 and Standby3.

2. Standby1 develops a NIC problem and is in and out of contact with
Master. As a result, it's flipping in and out of synchronous / degraded
mode.

3. Master fails catastrophically due to a RAID card meltdown. All data
lost.

At this point, the DBA is in kind of a pickle, because he doesn't know:

(a) Was Standby1 in synchronous or degraded mode when Master died? The
only log for that was on Master, which is now gone.

(b) Is Standby1 actually the most caught up standby, and thus the
appropriate new master for Standby2 and Standby3, or is it behind?

With the current functionality of Synchronous Replication, you don't
have either piece of indeterminancy, because some external management
process (hopefully located on another server) needs to disable
synchronous replication when Standby1 develops its problem. That is, if
the master is accepting synchronous transactions at all, you know that
Standby1 is up-to-date, and no data is lost.

While you can answer (b) by checking all servers, (a) is particularly
pernicious, because unless you have the application log all "operating
in degraded mode" messages, there is no way to ever determine the truth.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-16 19:01:10
Message-ID:	CA+TgmoajZcudU2nqfoTvAbt-H8TURqqBMt_enBx5nu5AzE3wVA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jul 14, 2012 at 7:54 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> So, here's the core issue with degraded mode. I'm not mentioning this
> to block any patch anyone has, but rather out of a desire to see someone
> address this core problem with some clever idea I've not thought of.
> The problem in a nutshell is: indeterminancy.
>
> Assume someone implements degraded mode. Then:
>
> 1. Master has one synchronous standby, Standby1, and two asynchronous,
> Standby2 and Standby3.
>
> 2. Standby1 develops a NIC problem and is in and out of contact with
> Master. As a result, it's flipping in and out of synchronous / degraded
> mode.
>
> 3. Master fails catastrophically due to a RAID card meltdown. All data
> lost.
>
> At this point, the DBA is in kind of a pickle, because he doesn't know:
>
> (a) Was Standby1 in synchronous or degraded mode when Master died? The
> only log for that was on Master, which is now gone.
>
> (b) Is Standby1 actually the most caught up standby, and thus the
> appropriate new master for Standby2 and Standby3, or is it behind?
>
> With the current functionality of Synchronous Replication, you don't
> have either piece of indeterminancy, because some external management
> process (hopefully located on another server) needs to disable
> synchronous replication when Standby1 develops its problem. That is, if
> the master is accepting synchronous transactions at all, you know that
> Standby1 is up-to-date, and no data is lost.
>
> While you can answer (b) by checking all servers, (a) is particularly
> pernicious, because unless you have the application log all "operating
> in degraded mode" messages, there is no way to ever determine the truth.

Good explanation.

In brief, the problem here is that you can only rely on the
no-transaction-loss guarantee provided by synchronous replication if
you can be certain that you'll always be aware of it when synchronous
replication gets shut off. Right now that is trivially true, because
it has to be shut off manually. If we provide a facility that logs a
message and then shuts it off, we lose that certainty, because the log
message could get eaten en route by the same calamity that takes down
the master. There is no way for the master to WAIT for the log
message to be delivered and only then degrade.

However, we could craft a mechanism that has this effect. Suppose we
create a new GUC with a name like
synchronous_replication_status_change_command. If we're thinking
about switching between synchronous replication and degraded mode
automatically, we first run this command. If it returns 0, then we're
allowed to switch, but if it returns anything else, then we're not
allowed to switch (but can retry the command after a suitable
interval). The user is responsible for supplying a command that
records the status change somewhere off-box in a fashion that's
sufficiently durable that the user has confidence that the
notification won't subsequently be lost. For example, the
user-supplied command could SSH into three machines located in
geographically disparate data centers and create a file with a certain
name on each one, returning 0 only if it's able to reach at least two
of them and create the file on all the ones it can reach. If the
master dies, but at least two out of the those three machines are
still alive, we can be certain of determining with confidence whether
the master might have been in degraded mode at the time of the crash.

More or less paranoid versions of this scheme are possible depending
on user preferences, but the key point is that for the
no-transaction-loss guarantee to be of any use, there has to be a way
to reliably know whether that guarantee was in effect at the time the
master died in a fire. Logging isn't enough, but I think some more
sophisticated mechanism can get us there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-17 05:58:57
Message-ID:	5004FF21.1020902@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 16.07.2012 22:01, Robert Haas wrote:
> On Sat, Jul 14, 2012 at 7:54 PM, Josh Berkus<josh(at)agliodbs(dot)com> wrote:
>> So, here's the core issue with degraded mode. I'm not mentioning this
>> to block any patch anyone has, but rather out of a desire to see someone
>> address this core problem with some clever idea I've not thought of.
>> The problem in a nutshell is: indeterminancy.
>>
>> Assume someone implements degraded mode. Then:
>>
>> 1. Master has one synchronous standby, Standby1, and two asynchronous,
>> Standby2 and Standby3.
>>
>> 2. Standby1 develops a NIC problem and is in and out of contact with
>> Master. As a result, it's flipping in and out of synchronous / degraded
>> mode.
>>
>> 3. Master fails catastrophically due to a RAID card meltdown. All data
>> lost.
>>
>> At this point, the DBA is in kind of a pickle, because he doesn't know:
>>
>> (a) Was Standby1 in synchronous or degraded mode when Master died? The
>> only log for that was on Master, which is now gone.
>>
>> (b) Is Standby1 actually the most caught up standby, and thus the
>> appropriate new master for Standby2 and Standby3, or is it behind?
>>
>> With the current functionality of Synchronous Replication, you don't
>> have either piece of indeterminancy, because some external management
>> process (hopefully located on another server) needs to disable
>> synchronous replication when Standby1 develops its problem. That is, if
>> the master is accepting synchronous transactions at all, you know that
>> Standby1 is up-to-date, and no data is lost.
>>
>> While you can answer (b) by checking all servers, (a) is particularly
>> pernicious, because unless you have the application log all "operating
>> in degraded mode" messages, there is no way to ever determine the truth.
>
> Good explanation.
>
> In brief, the problem here is that you can only rely on the
> no-transaction-loss guarantee provided by synchronous replication if
> you can be certain that you'll always be aware of it when synchronous
> replication gets shut off. Right now that is trivially true, because
> it has to be shut off manually. If we provide a facility that logs a
> message and then shuts it off, we lose that certainty, because the log
> message could get eaten en route by the same calamity that takes down
> the master. There is no way for the master to WAIT for the log
> message to be delivered and only then degrade.
>
> However, we could craft a mechanism that has this effect. Suppose we
> create a new GUC with a name like
> synchronous_replication_status_change_command. If we're thinking
> about switching between synchronous replication and degraded mode
> automatically, we first run this command. If it returns 0, then we're
> allowed to switch, but if it returns anything else, then we're not
> allowed to switch (but can retry the command after a suitable
> interval). The user is responsible for supplying a command that
> records the status change somewhere off-box in a fashion that's
> sufficiently durable that the user has confidence that the
> notification won't subsequently be lost. For example, the
> user-supplied command could SSH into three machines located in
> geographically disparate data centers and create a file with a certain
> name on each one, returning 0 only if it's able to reach at least two
> of them and create the file on all the ones it can reach. If the
> master dies, but at least two out of the those three machines are
> still alive, we can be certain of determining with confidence whether
> the master might have been in degraded mode at the time of the crash.
>
> More or less paranoid versions of this scheme are possible depending
> on user preferences, but the key point is that for the
> no-transaction-loss guarantee to be of any use, there has to be a way
> to reliably know whether that guarantee was in effect at the time the
> master died in a fire. Logging isn't enough, but I think some more
> sophisticated mechanism can get us there.

Yeah, I think that's the right general approach. Not necessarily that
exact GUC, but something like that. I don't want PostgreSQL to get more
involved in determining the state of the standby, when to do failover,
or when to fall back to degraded mode. That's a whole new territory with
all kinds of problems, and there is plenty of software out there to
handle that. Usually you have some external software to do monitoring
and to initiate failovers anyway. What we need is a better API for
co-operating with such software, to perform failover, and to switch
replication between synchronous and asynchronous modes.

BTW, one little detail that I don't think has been mentioned in this
thread before: Even though the master currently knows whether a standby
is connected or not, and you could write a patch to act based on that,
there are other failure scenarios where you would still not be happy.
For example, imagine that the standby has a disk failure. It stays
connected to the master, but fails to fsync anything to disk. Would you
want to fall back to degraded mode and just do asynchronous replication
in that case? How do you decide when to do that in the master? Or what
if the standby keeps making progress, but becomes incredibly slow for
some reason, like disk failure in a RAID array? I'd rather outsource all
that logic to external monitoring software - software that you should be
running anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Daniel Farina <daniel(at)heroku(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-17 06:45:43
Message-ID:	CAAZKuFZLReW_XdMs=TgvgcqW=xtjAL7WwB+NfkMdDEki-oBZ-Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 16, 2012 at 10:58 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> BTW, one little detail that I don't think has been mentioned in this thread
> before: Even though the master currently knows whether a standby is
> connected or not, and you could write a patch to act based on that, there
> are other failure scenarios where you would still not be happy. For example,
> imagine that the standby has a disk failure. It stays connected to the
> master, but fails to fsync anything to disk. Would you want to fall back to
> degraded mode and just do asynchronous replication in that case? How do you
> decide when to do that in the master? Or what if the standby keeps making
> progress, but becomes incredibly slow for some reason, like disk failure in
> a RAID array? I'd rather outsource all that logic to external monitoring
> software - software that you should be running anyway.

I would like to express some support for the non-edge nature of this
case. Outside of simple loss of availability of a server, losing
access to a block device is probably the second-most-common cause of
loss of availability for me. It's especially insidious because simple
"select 1" checks may continue to return for quite some time, so
instead we rely on linux diskstats parsing to see if write progress
hits zero for "a while."

In cases like these, the overhead of a shell-command to rapidly
consort with a decision-making process can be prohibitive -- it's
already a pretty big waster of time for me in wal
archiving/dearchiving, where process startup and SSL negotiation and
lack of parallelization can be pretty slow. This may also exhibit
this problem.

I would like to plead that whatever is done would be most useful being
controllable via non-GUCs in its entirely -- arguably that is already
the case, since one can write a replication protocol client to do the
job, by faking the standby status update messages, but perhaps there
is a more lucid way if one makes accommodation. In particular, the
awkwardness of using pg_receivexlog[0] or a similar tool for replacing
archive_command is something that I feel should be addressed
eventually, as to not be a second-class citizen. Although that is
already being worked on[1]...the archive command has no backpressure
either, other than "out of disk".

The case of restore_command is even more sore: remastering or
archive-recovery via streaming protocol actions is kind of a pain at
the moment. I haven't thoroughly explored this yet and I don't think
it is documented, but it can be hard for something that is dearchiving
from wal segments stored somewhere to find exactly the right record to
start replaying at: the wal record format is not stable, and it need
not be, if the server helps by ignoring records that predate what it
requires or can inform the process feeding WAL that it got things
wrong. Maybe that is the case, but it is not documented. I also
don't think any guarantees around the maximum size or alignment of WAL
shipped by the streaming protocol in XLogData messages, and that's too
bad. Also, the endianness of WAL position fields in the XLogData is
host-byte-order-dependent, which sucks if you are forwarding WAL
around but need to know what range is contained in a message. In
practice many people can say "all I have is little-endian," but it is
somewhat unpleasant and not necessarily the case.

Correct me if I'm wrong, I'd be glad for it.

[0]: see the notes section,
http://www.postgresql.org/docs/devel/static/app-pgreceivexlog.html
[1]: http://archives.postgresql.org/pgsql-hackers/2012-06/msg00348.php

--
fdr

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
Cc:	Hampus Wessman <hampus(at)hampuswessman(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Synchronous Standalone Master Redoux
Date:	2012-07-25 21:43:23
Message-ID:	20120725214323.GC21271@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 13, 2012 at 08:08:59PM -0430, Jose Ildefonso Camargo Tolosa wrote:
> On Fri, Jul 13, 2012 at 10:22 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > On Fri, Jul 13, 2012 at 09:12:56AM +0200, Hampus Wessman wrote:
> >> How you decide what to do with the servers on failures isn't that
> >> important here, really. You can probably run e.g. Pacemaker on 3+
> >> machines and have it check for quorums to accomplish this. That's a
> >> good approach at least. You can still have only 2 database servers
> >> (for cost reasons), if you want. PostgreSQL could have all this
> >> built-in, but I don't think it sounds overly useful to only be able
> >> to disable synchronous replication on the primary after a timeout.
> >> Then you can never safely do a failover to the secondary, because
> >> you can't be sure synchronous replication was active on the failed
> >> primary...
> >
> > So how about this for a Postgres TODO:
> >
> > Add configuration variable to allow Postgres to disable synchronous
> > replication after a specified timeout, and add variable to alert
> > administrators of the change.
>
> I agree we need a TODO for this, but... I think timeout-only is not
> the best choice, there should be a maximum timeout (as a last
> resource: the maximum time we are willing to wait for standby, this
> have to have the option of "forever"), but certainly PostgreSQL have
> to detect the *complete* disconnection of the standby (or all standbys
> on the synchronous_standby_names), if it detects that no standbys are
> eligible for sync standby AND the option to do fallback to async is
> enabled = it will go into standalone mode (as if
> synchronous_standby_names were empty), otherwise (if option is
> disabled) it will just continue to wait for ever (the "last resource"
> timeout is ignored if the fallback option is disabled).... I would
> call this "soft_synchronous_standby", and
> "soft_synchronous_standby_timeout" (in seconds, 0=forever, a sane
> value would be ~5 seconds) or something like that (I'm quite bad at
> picking names :( ).

TODO added:

Allow synchronous_standby_names to be disabled after communication
failure with all synchronous standby servers exceeds some timeout

This also requires successful execution of a synchronous
notification command.

http://archives.postgresql.org/pgsql-hackers/2012-07/msg00409.php

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +