Quick Links

Re: Issues with Quorum Commit

Lists:	pgsql-hackers

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Issues with Quorum Commit
Date:	2010-10-05 19:11:20
Message-ID:	4CAB7858.70303@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

All,

There's been a lot of discussion on synch rep lately which involves
quorum commit. I need to raise some major design issues with quorum
commit which I don't think that people have really considered, and may
be sufficient to prevent it from being included in 9.1.

A. Permanent Synchronization Failure
---------------------------------
Quorum commit, like other forms of more-than-one-standby synch rep,
offers the possibility that one or more standbys could end up
irretrievably desyncronized with the master.

1. Quorum is 3 servers (out of 5) with mode "apply"
2. Standbys 2 and 4 receive and apply transaction # 20001.
3. Due to a network issue, no other standby applies #20001.
4. Accordingly, the master rolls back #20001 and cancels, either due to
timeout or DBA cancel.
5. #2 and #5 are now hopelessly out of synch with the master.

B. Eventual Inconsistency
-------------------------
If we have a quorum commit, it's possible for any individual standby to
be indefinitely ahead of any standby which is not needed by the quorum.
This means that:

-- There is no clear criteria for when a standby which is not needed for
quorum should be considered no longer a synch standby, and
-- Applications cannot make assumptions that synch rep promises some
specific window of synchronicity, eliminating a lot of the value of
quorum commit.

C. Performance
--------------
Doing quorum commit requires significant extra accounting on the
master's part: it must keep track of how many standbys committed for
each pending transaction (and remember there may be many at the same
time).

Doing so could involve significant response-time overhead added to the
simple case where there is only one standby, as well as memory usage,
and likely a lot of troubleshooting of the mechanism from us.

D. Adding/Replacing Quorum Members
----------------------------------
For Quorum commit to be really valuable, we need to be able to add new
quorum members and remove dead ones *without stopping the master*. Per
discussion about the startup issues with only one master, we have not
worked out how to do this for synch rep standbys. It's reasonable to
assume that this will be more complex for a quorum group than with a
single synch standby.

Consider the case, for example, where due to a network outage we have
dropped below quorum. What is the strategy for getting the system
running again by adding standbys?

All of the problems above are resolvable. Some of the CAP databases
have probably resolved them, as well as some older telecom databases.
However, all of them will require significant work, and even more
significant debugging, from the project.

I would like to see Quorum Commit, in part because I think it would help
push PostgreSQL further into cloud frameworks. However, I'm worried
that if we make quorum commit a requirement of synch rep, we will not
have synch rep in 9.1.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-05 19:32:34
Message-ID:	4CAB7D52.8000402@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 05.10.2010 22:11, Josh Berkus wrote:
> There's been a lot of discussion on synch rep lately which involves
> quorum commit. I need to raise some major design issues with quorum
> commit which I don't think that people have really considered, and may
> be sufficient to prevent it from being included in 9.1.

Thanks for bringing these up.

> A. Permanent Synchronization Failure
> ---------------------------------
> Quorum commit, like other forms of more-than-one-standby synch rep,
> offers the possibility that one or more standbys could end up
> irretrievably desyncronized with the master.
>
> 1. Quorum is 3 servers (out of 5) with mode "apply"
> 2. Standbys 2 and 4 receive and apply transaction # 20001.
> 3. Due to a network issue, no other standby applies #20001.
> 4. Accordingly, the master rolls back #20001 and cancels, either due to
> timeout or DBA cancel.

The master can not roll back or cancel the transaction. That's
completely infeasible, the WAL record has been written to local disk
already. The best it can do is halt and wait for enough standbys to
appear to fulfill the quorum. The client will hang waiting for the
COMMIT to finish, and the transaction will appear as in-progress to
other transactions.

There's subtle point here that I don't think has been discussed yet: If
the master is forcibly restarted at that point, with pg_ctl restart -m
immediate, strictly speaking the master should start up in the same
state, with the unlucky transaction still appearing as in-progress,
until the standby acknowledges.

> 5. #2 and #5 are now hopelessly out of synch with the master.

> B. Eventual Inconsistency
> -------------------------
> If we have a quorum commit, it's possible for any individual standby to
> be indefinitely ahead of any standby which is not needed by the quorum.
> This means that:
>
> -- There is no clear criteria for when a standby which is not needed for
> quorum should be considered no longer a synch standby, and
> -- Applications cannot make assumptions that synch rep promises some
> specific window of synchronicity, eliminating a lot of the value of
> quorum commit.

Yep.

> C. Performance
> --------------
> Doing quorum commit requires significant extra accounting on the
> master's part: it must keep track of how many standbys committed for
> each pending transaction (and remember there may be many at the same
> time).
>
> Doing so could involve significant response-time overhead added to the
> simple case where there is only one standby, as well as memory usage,
> and likely a lot of troubleshooting of the mechanism from us.

My gut feeling is that overhead will pale to insignificance compared to
the network and other overheads of actually getting the WAL to the
standby and processing the acknowledgments.

> D. Adding/Replacing Quorum Members
> ----------------------------------
> For Quorum commit to be really valuable, we need to be able to add new
> quorum members and remove dead ones *without stopping the master*. Per
> discussion about the startup issues with only one master, we have not
> worked out how to do this for synch rep standbys. It's reasonable to
> assume that this will be more complex for a quorum group than with a
> single synch standby.
>
> Consider the case, for example, where due to a network outage we have
> dropped below quorum. What is the strategy for getting the system
> running again by adding standbys?

You start a new one from the latest base backup and let it catch up?
Possibly modifying the config file in the master to let it know about
the new standby, if we go down that path. This part doesn't seem
particularly hard to me.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-05 20:43:20
Message-ID:	4CAB8DE8.1000503@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki,

> The master can not roll back or cancel the transaction. That's
> completely infeasible, the WAL record has been written to local disk
> already. The best it can do is halt and wait for enough standbys to
> appear to fulfill the quorum. The client will hang waiting for the
> COMMIT to finish, and the transaction will appear as in-progress to
> other transactions.

Ohhh. Good point. So there's no real point in a timeout setting for
quorum commit; it's always "wait forever".

So, this is a critical issue with "wait forever" even with one server.

> There's subtle point here that I don't think has been discussed yet: If
> the master is forcibly restarted at that point, with pg_ctl restart -m
> immediate, strictly speaking the master should start up in the same
> state, with the unlucky transaction still appearing as in-progress,
> until the standby acknowledges.

Yeah. That makes the ability to issue a command which says "drop all
synch rep and commit whatever's pending" to be critical.

However, this makes for, in some ways, a worse situation: if you fail to
achieve quorum on any commit, then you need to rebuild your entire
quorum pool from scratch.

> You start a new one from the latest base backup and let it catch up?
> Possibly modifying the config file in the master to let it know about
> the new standby, if we go down that path. This part doesn't seem
> particularly hard to me.

Yeah? How do you modify the config file and get the master to consider
the new server to be part of the quorum pool *without restarting the
master*?

Again, I'm just saying that merely doing single-server synch rep, *and*
making HS/SR easier to admin in general, is going to be a big task for
9.1. Quorum Commit needs to be considered a separate feature, and one
which is dispensible for 9.1.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-05 20:45:45
Message-ID:	1286311545.9356.15.camel@jdavis-ux.asterdata.local
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2010-10-05 at 12:11 -0700, Josh Berkus wrote:
> B. Eventual Inconsistency
> -------------------------
> If we have a quorum commit, it's possible for any individual standby to
> be indefinitely ahead of any standby which is not needed by the quorum.
> This means that:
>
> -- There is no clear criteria for when a standby which is not needed for
> quorum should be considered no longer a synch standby, and
> -- Applications cannot make assumptions that synch rep promises some
> specific window of synchronicity, eliminating a lot of the value of
> quorum commit.

Point B seems particularly dangerous.

When you lose one of the systems and the lagging server becomes required
for quorum, then all of a sudden you could be facing a huge delay to
commit the next transaction (because it needs to catch up on a lot of
WAL replay). This can happen even without a network problem at all, and
seems very likely to result in the lagging system being considered
"down" due to a timeout. Not good, because the reason it is required for
quorum is because another standby just went down.

In other words, a lagging standby combined with a timeout mechanism is
essentially useless, because it will never catch up in time to be a part
of the quorum.

Regards,
Jeff Davis

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-05 21:10:53
Message-ID:	1286313053.2025.2877.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2010-10-05 at 22:32 +0300, Heikki Linnakangas wrote:
> On 05.10.2010 22:11, Josh Berkus wrote:
> > There's been a lot of discussion on synch rep lately which involves
> > quorum commit. I need to raise some major design issues with quorum
> > commit which I don't think that people have really considered, and may
> > be sufficient to prevent it from being included in 9.1.
>
> Thanks for bringing these up.

Yes, I'm very happy to discuss these.

The points appear to be directed at "quorum commit", which is a name
I've used. But most of the points apply more to Fujii's patch than my
own. I can only presume that Josh wants to prevent us from adopting a
design that allows sync against multiple standbys.

> > A. Permanent Synchronization Failure
> > ---------------------------------
> > Quorum commit, like other forms of more-than-one-standby synch rep,
> > offers the possibility that one or more standbys could end up
> > irretrievably desyncronized with the master.
> >
> > 1. Quorum is 3 servers (out of 5) with mode "apply"
> > 2. Standbys 2 and 4 receive and apply transaction # 20001.
> > 3. Due to a network issue, no other standby applies #20001.
> > 4. Accordingly, the master rolls back #20001 and cancels, either due to
> > timeout or DBA cancel.
>
> The master can not roll back or cancel the transaction. That's
> completely infeasible, the WAL record has been written to local disk
> already. The best it can do is halt and wait for enough standbys to
> appear to fulfill the quorum. The client will hang waiting for the
> COMMIT to finish, and the transaction will appear as in-progress to
> other transactions.

Yes, that point has long been understood. Neither patch does this, and
in fact the issue is a completely general one.

That is a very important point, but again, nothing to do with quorum
commit. For strict correctness, we should do that. Are you suggesting we
should do that here?

> > 5. #2 and #5 are now hopelessly out of synch with the master.
>
> > B. Eventual Inconsistency
> > -------------------------
> > If we have a quorum commit, it's possible for any individual standby to
> > be indefinitely ahead of any standby which is not needed by the quorum.
> > This means that:
> >
> > -- There is no clear criteria for when a standby which is not needed for
> > quorum should be considered no longer a synch standby, and
> > -- Applications cannot make assumptions that synch rep promises some
> > specific window of synchronicity, eliminating a lot of the value of
> > quorum commit.
>
> Yep.

Could the person that wrote that actually explain what a "specific
window of synchronicity" is? I'm not sure whether to agree, or disagree.

> > C. Performance
> > --------------
> > Doing quorum commit requires significant extra accounting on the
> > master's part: it must keep track of how many standbys committed for
> > each pending transaction (and remember there may be many at the same
> > time).
> >
> > Doing so could involve significant response-time overhead added to the
> > simple case where there is only one standby, as well as memory usage,
> > and likely a lot of troubleshooting of the mechanism from us.
>
> My gut feeling is that overhead will pale to insignificance compared to
> the network and other overheads of actually getting the WAL to the
> standby and processing the acknowledgments.

You're ignoring Josh's points. Those exact points have been made by me
in support of the design of my patch and against Fujii's. The mechanism
to do this will be more complex and more likely to break. And it will be
slower and that is a concern for me.

> > D. Adding/Replacing Quorum Members
> > ----------------------------------
> > For Quorum commit to be really valuable, we need to be able to add new
> > quorum members and remove dead ones *without stopping the master*. Per
> > discussion about the startup issues with only one master, we have not
> > worked out how to do this for synch rep standbys. It's reasonable to
> > assume that this will be more complex for a quorum group than with a
> > single synch standby.
> >
> > Consider the case, for example, where due to a network outage we have
> > dropped below quorum. What is the strategy for getting the system
> > running again by adding standbys?
>
> You start a new one from the latest base backup and let it catch up?
> Possibly modifying the config file in the master to let it know about
> the new standby, if we go down that path. This part doesn't seem
> particularly hard to me.

Agreed, not sure of the issue there.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-05 21:19:00
Message-ID:	1286313540.2025.2923.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2010-10-05 at 13:45 -0700, Jeff Davis wrote:
> On Tue, 2010-10-05 at 12:11 -0700, Josh Berkus wrote:
> > B. Eventual Inconsistency
> > -------------------------
> > If we have a quorum commit, it's possible for any individual standby to
> > be indefinitely ahead of any standby which is not needed by the quorum.
> > This means that:
> >
> > -- There is no clear criteria for when a standby which is not needed for
> > quorum should be considered no longer a synch standby, and
> > -- Applications cannot make assumptions that synch rep promises some
> > specific window of synchronicity, eliminating a lot of the value of
> > quorum commit.
>
> Point B seems particularly dangerous.
>
> When you lose one of the systems and the lagging server becomes required
> for quorum, then all of a sudden you could be facing a huge delay to
> commit the next transaction (because it needs to catch up on a lot of
> WAL replay). This can happen even without a network problem at all, and
> seems very likely to result in the lagging system being considered
> "down" due to a timeout. Not good, because the reason it is required for
> quorum is because another standby just went down.
>
> In other words, a lagging standby combined with a timeout mechanism is
> essentially useless, because it will never catch up in time to be a part
> of the quorum.

Thanks for explaining what was meant.

This issue is a serious problem with the apply to *all* servers that
Heikki has been describing as being a useful use case. We register a
standby, it goes down and we decide to wait for it. Then when it does
come back up it takes ages to catch up.

This is really the nail in the coffin for the "All" servers use case,
and a significant blow to the requirement for standby registration.

If we use N+1 redundancy as I have explained, then this situation does
not occur until you have less than N standbys available. But then it's
no surprise that RAID-5 won't work with 4 drives either.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-05 21:21:31
Message-ID:	AANLkTikg6e2XoUa2Adsccxyvd2BiV0j9mdx9=DhLz6Zb@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 5, 2010 at 5:10 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> The points appear to be directed at "quorum commit", which is a name
> I've used. But most of the points apply more to Fujii's patch than my
> own. I can only presume that Josh wants to prevent us from adopting a
> design that allows sync against multiple standbys.

This looks to me like a cheap shot that doesn't advance the
discussion. You are the first to complain when people don't take your
ideas as seriously as you feel they should.

>> > A. Permanent Synchronization Failure
>> > ---------------------------------
>> > Quorum commit, like other forms of more-than-one-standby synch rep,
>> > offers the possibility that one or more standbys could end up
>> > irretrievably desyncronized with the master.
>> >
>> > 1. Quorum is 3 servers (out of 5) with mode "apply"
>> > 2. Standbys 2 and 4 receive and apply transaction # 20001.
>> > 3. Due to a network issue, no other standby applies #20001.
>> > 4. Accordingly, the master rolls back #20001 and cancels, either due to
>> > timeout or DBA cancel.
>>
>> The master can not roll back or cancel the transaction. That's
>> completely infeasible, the WAL record has been written to local disk
>> already. The best it can do is halt and wait for enough standbys to
>> appear to fulfill the quorum. The client will hang waiting for the
>> COMMIT to finish, and the transaction will appear as in-progress to
>> other transactions.
>
> Yes, that point has long been understood. Neither patch does this, and
> in fact the issue is a completely general one.

Yep.

>> There's subtle point here that I don't think has been discussed yet: If
>> the master is forcibly restarted at that point, with pg_ctl restart -m
>> immediate, strictly speaking the master should start up in the same
>> state, with the unlucky transaction still appearing as in-progress,
>> until the standby acknowledges.
>
> That is a very important point, but again, nothing to do with quorum
> commit. For strict correctness, we should do that. Are you suggesting we
> should do that here?

I agree that this has nothing to do with quorum commit. It does have
to do with synchronous replication, but I'm skeptical that we want to
get into it for this release, if ever.

>> > 5. #2 and #5 are now hopelessly out of synch with the master.
>>
>> > B. Eventual Inconsistency
>> > -------------------------
>> > If we have a quorum commit, it's possible for any individual standby to
>> > be indefinitely ahead of any standby which is not needed by the quorum.
>> > This means that:
>> >
>> > -- There is no clear criteria for when a standby which is not needed for
>> > quorum should be considered no longer a synch standby, and
>> > -- Applications cannot make assumptions that synch rep promises some
>> > specific window of synchronicity, eliminating a lot of the value of
>> > quorum commit.
>>
>> Yep.
>
> Could the person that wrote that actually explain what a "specific
> window of synchronicity" is? I'm not sure whether to agree, or disagree.

Me either.

>> > C. Performance
>> > --------------
>> > Doing quorum commit requires significant extra accounting on the
>> > master's part: it must keep track of how many standbys committed for
>> > each pending transaction (and remember there may be many at the same
>> > time).
>> >
>> > Doing so could involve significant response-time overhead added to the
>> > simple case where there is only one standby, as well as memory usage,
>> > and likely a lot of troubleshooting of the mechanism from us.
>>
>> My gut feeling is that overhead will pale to insignificance compared to
>> the network and other overheads of actually getting the WAL to the
>> standby and processing the acknowledgments.
>
> You're ignoring Josh's points. Those exact points have been made by me
> in support of the design of my patch and against Fujii's. The mechanism
> to do this will be more complex and more likely to break. And it will be
> slower and that is a concern for me.

I don't think Heikki ignored Josh's points, and I do think Heikki's
analysis is correct.

>> > D. Adding/Replacing Quorum Members
>> > ----------------------------------
>> > For Quorum commit to be really valuable, we need to be able to add new
>> > quorum members and remove dead ones *without stopping the master*. Per
>> > discussion about the startup issues with only one master, we have not
>> > worked out how to do this for synch rep standbys. It's reasonable to
>> > assume that this will be more complex for a quorum group than with a
>> > single synch standby.
>> >
>> > Consider the case, for example, where due to a network outage we have
>> > dropped below quorum. What is the strategy for getting the system
>> > running again by adding standbys?
>>
>> You start a new one from the latest base backup and let it catch up?
>> Possibly modifying the config file in the master to let it know about
>> the new standby, if we go down that path. This part doesn't seem
>> particularly hard to me.
>
> Agreed, not sure of the issue there.

Also agreed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-05 21:30:42
Message-ID:	1286314242.2025.2980.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2010-10-05 at 13:43 -0700, Josh Berkus wrote:
> Again, I'm just saying that merely doing single-server synch rep,
> *and*
> making HS/SR easier to admin in general, is going to be a big task for
> 9.1. Quorum Commit needs to be considered a separate feature, and one
> which is dispensible for 9.1.

Agreed.

So no need at all for standby.conf. Phew!

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-05 21:37:29
Message-ID:	1286314649.2025.3015.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2010-10-05 at 17:21 -0400, Robert Haas wrote:
> On Tue, Oct 5, 2010 at 5:10 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > The points appear to be directed at "quorum commit", which is a name
> > I've used. But most of the points apply more to Fujii's patch than my
> > own. I can only presume that Josh wants to prevent us from adopting a
> > design that allows sync against multiple standbys.
>
> This looks to me like a cheap shot that doesn't advance the
> discussion. You are the first to complain when people don't take your
> ideas as seriously as you feel they should.

Whatever are you talking about? This is a technical discussion.

I'm checking what Josh actually means by Quorum Commit, since
regrettably the points fall very badly against Fujii's patch. Josh has
echoed some points of mine and Jeff's point about dangerous behaviour
blows a hole a mile wide in the justification for standby.conf etc..

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-05 22:14:22
Message-ID:	4CABA33E.8020802@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon, Robert,

> The points appear to be directed at "quorum commit", which is a name
> I've used. But most of the points apply more to Fujii's patch than my
> own.

Per previous discussion, I'm trying to get at what reasonable behavior
is, rather than targeting one patch or the other.

> I can only presume that Josh wants to prevent us from adopting a
> design that allows sync against multiple standbys.

Quorum commit == "X servers need to ack for commit", where X > 1.
Usually done as "X out of Y servers must ack", but it's not a given that
the master needs to know how many servers there are, just how many ack'ed.

And I'm not against it; I'm just pointing out that it gives us some
issues which we don't have with a single standby, and thus quorum commit
ought to be treated as a separate feature in 9.1 development.

>> The master can not roll back or cancel the transaction. That's
>> completely infeasible, the WAL record has been written to local disk
>> already. The best it can do is halt and wait for enough standbys to
>> appear to fulfill the quorum. The client will hang waiting for the
>> COMMIT to finish, and the transaction will appear as in-progress to
>> other transactions.
>
> Yes, that point has long been understood. Neither patch does this, and
> in fact the issue is a completely general one.

So, in that case, if it's been 10 minutes, and we're still not getting
ack from standbys, what's the exit strategy for the hapless DBA?
Practically speaking? Without restarting the master?

Last I checked, our goal with synch standby was to increase availablity,
not decrease it. This is, however, not an issue with quorum commit, but
an issue with sync rep in general.

> Could the person that wrote that actually explain what a "specific
> window of synchronicity" is? I'm not sure whether to agree, or disagree.

A specific amount of time within which all nodes will be consistent
regarding that specific transaction.

>> You start a new one from the latest base backup and let it catch up?
>> Possibly modifying the config file in the master to let it know about
>> the new standby, if we go down that path. This part doesn't seem
>> particularly hard to me.
>
> Agreed, not sure of the issue there.

See previous post. The critical phrase is *without restarting the
master*. AFAICT, no patch has addressed the need to change the master's
synch configuration without restarting it. It's possible that I'm not
following something, in which case I'd love to have it pointed out.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-05 22:40:40
Message-ID:	1286318440.2025.3353.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2010-10-05 at 15:14 -0700, Josh Berkus wrote:

> > I can only presume that Josh wants to prevent us from adopting a
> > design that allows sync against multiple standbys.
>
> Quorum commit == "X servers need to ack for commit", where X > 1.
> Usually done as "X out of Y servers must ack", but it's not a given that
> the master needs to know how many servers there are, just how many ack'ed.
>
> And I'm not against it; I'm just pointing out that it gives us some
> issues which we don't have with a single standby, and thus quorum commit
> ought to be treated as a separate feature in 9.1 development.

OK, so I did understand you correctly.

Heikki had argued that a use case existed where Y out of Y (i.e. all)
nodes must acknowledge before we commit. That was the use case that
required us to have standby registration. It was optional in all other
cases.

We should note that Oracle only allows X=1, i.e. first acknowledgement
releases waiter. My patch provides X=1 only and takes advantage of the
simpler in-memory data structures as a result.

> >> The master can not roll back or cancel the transaction. That's
> >> completely infeasible, the WAL record has been written to local disk
> >> already. The best it can do is halt and wait for enough standbys to
> >> appear to fulfill the quorum. The client will hang waiting for the
> >> COMMIT to finish, and the transaction will appear as in-progress to
> >> other transactions.
> >
> > Yes, that point has long been understood. Neither patch does this, and
> > in fact the issue is a completely general one.
>
> So, in that case, if it's been 10 minutes, and we're still not getting
> ack from standbys, what's the exit strategy for the hapless DBA?
> Practically speaking? Without restarting the master?
>
> Last I checked, our goal with synch standby was to increase availablity,
> not decrease it. This is, however, not an issue with quorum commit, but
> an issue with sync rep in general.

Completely agree. When we had that discussion some months/weeks back, we
spoke about having a timeout. My patch has implemented a timeout,
followed by a COMMIT. That allows increased availability, as you say.

You would also be able to specifically release all/some transactions
from wait state with a simple function pg_cancel_sync_wait() (or similar
name).

> > Could the person that wrote that actually explain what a "specific
> > window of synchronicity" is? I'm not sure whether to agree, or disagree.
>
> A specific amount of time within which all nodes will be consistent
> regarding that specific transaction.

Certainly no patch offers that. I'm not sure such a possibility exists.
Asking for higher X does make that situation worse.

> >> You start a new one from the latest base backup and let it catch up?
> >> Possibly modifying the config file in the master to let it know about
> >> the new standby, if we go down that path. This part doesn't seem
> >> particularly hard to me.
> >
> > Agreed, not sure of the issue there.
>
> See previous post. The critical phrase is *without restarting the
> master*. AFAICT, no patch has addressed the need to change the master's
> synch configuration without restarting it. It's possible that I'm not
> following something, in which case I'd love to have it pointed out.

My patch does not require a restart of the master to add/remove sync rep
nodes. They just come and go as needed.

I don't think Fujii's patch would have a great problem with that either,
but I can't speak for that with precision.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 00:14:35
Message-ID:	4CABBF6B.7040605@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Heikki had argued that a use case existed where Y out of Y (i.e. all)
> nodes must acknowledge before we commit. That was the use case that
> required us to have standby registration. It was optional in all other
> cases.

Yeah, Y of Y is just a special case of X of Y. And, IMHO, rather
pointless if we can't guarantee consistency between the standbys, which
we can't.

> We should note that Oracle only allows X=1, i.e. first acknowledgement
> releases waiter. My patch provides X=1 only and takes advantage of the
> simpler in-memory data structures as a result.

I agree that we ought to start with X=1 for 9.1 and leave more
complicated architectures until we have that committed and tested.

> You would also be able to specifically release all/some transactions
> from wait state with a simple function pg_cancel_sync_wait() (or similar
> name).

That would be fine for the use cases I'll be implementing.

> My patch does not require a restart of the master to add/remove sync rep
> nodes. They just come and go as needed.
>
> I don't think Fujii's patch would have a great problem with that either,
> but I can't speak for that with precision.

Ok. That really was not made clear in prior arguments.

FYI, for the production uses of synch rep I'd specifically be
implementing, what the users would want is:

1) One master, one synch standby, 1-2 asynch standbys
2) Synch rep tries to synch for # seconds.
3) If it fails, it switches the synch standby to asynch and screams
bloody murder somewhere nagios can pick it up.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 01:52:10
Message-ID:	1286329930.28453.72.camel@jdavis-ux.asterdata.local
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2010-10-05 at 22:19 +0100, Simon Riggs wrote:
> > In other words, a lagging standby combined with a timeout mechanism is
> > essentially useless, because it will never catch up in time to be a part
> > of the quorum.
>
> Thanks for explaining what was meant.
>
> This issue is a serious problem with the apply to *all* servers that
> Heikki has been describing as being a useful use case. We register a
> standby, it goes down and we decide to wait for it. Then when it does
> come back up it takes ages to catch up.
>
> This is really the nail in the coffin for the "All" servers use case,
> and a significant blow to the requirement for standby registration.

I'm not sure I entirely understand. I was concerned about the case of a
standby server being allowed to lag behind the rest by a large number of
WAL records. That can't happen in the "wait for all servers to apply"
case, because the system would become unavailable rather than allow a
significant difference in the amount of WAL applied.

I'm not saying that an unavailable system is good, but I don't see how
my particular complaint applies to the "wait for all servers to apply"
case.

The case I was worried about is:
* 1 master and 2 standby
* The rule is "wait for at least one standby to apply the WAL"

In your notation, I believe that's M -> { S1, S2 }

In that case, if one S1 is just a little faster than S2, then S2 might
build up a significant queue of unapplied WAL. Then, when S1 goes down,
there's no way for the slower one to acknowledge a new transaction
without playing through all of the unapplied WAL.

Intuitively, the administrator would think that he was getting both HA
and redundancy, but in reality the availability is no better than if
there were only two servers (M -> S1), except that it might be faster to
replay the WAL then to set up a new standby (but that's not guaranteed).

I think you would call that a misconfiguration, and I would agree. I was
just trying to point out a pitfall that I didn't see until I read Josh's
email.

> If we use N+1 redundancy as I have explained, then this situation does
> not occur until you have less than N standbys available. But then it's
> no surprise that RAID-5 won't work with 4 drives either.

Now I'm more confused. I assume that was a typo (because a RAID-5 does
work with 4 drives), but I think it obscured your point.

Regards,
Jeff Davis

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 02:31:49
Message-ID:	1286332309.2025.3941.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2010-10-05 at 18:52 -0700, Jeff Davis wrote:

> I'm not saying that an unavailable system is good, but I don't see how
> my particular complaint applies to the "wait for all servers to apply"
> case.

> The case I was worried about is:
> * 1 master and 2 standby
> * The rule is "wait for at least one standby to apply the WAL"
>
> In your notation, I believe that's M -> { S1, S2 }
>
> In that case, if one S1 is just a little faster than S2, then S2 might
> build up a significant queue of unapplied WAL. Then, when S1 goes down,
> there's no way for the slower one to acknowledge a new transaction
> without playing through all of the unapplied WAL.

That situation would require two things
* First, you have set up async replication and you're not monitoring it
properly. Shame on you.
* Second, you would have to request "apply" mode sync rep. If you had
requested "recv" or "fsync" mode, then the standby does *not* have to
have applied the WAL before acknowledgement.

Since the first problem is a generic problem with async replication, and
can already happen in 8.2+, its not exactly an argument against a new
feature.

> Intuitively, the administrator would think that he was getting both HA
> and redundancy, but in reality the availability is no better than if
> there were only two servers (M -> S1), except that it might be faster to
> replay the WAL then to set up a new standby (but that's not guaranteed).

Not guaranteed, but very likely that the standby would not be that far
behind. If it gets too far behind it will likely blow out the disk space
on the standby and fail.

> I think you would call that a misconfiguration, and I would agree.

Yes, regrettably there are various ways to misconfigure this. The above
is really a degeneration of the 2 standby case into the 1 standby case:
if you ask for 2 standbys and one of them is ineffective, then the
system acts like you have only one.

> I was
> just trying to point out a pitfall that I didn't see until I read Josh's
> email.

You mention that it cannot occur if we choose to lock up the master and
cause transactions to wait. That may be true in many cases. It does
still occur when we have transactions that generate a large amount of
WAL, loads, ALTER TABLEs etc.. In those cases, S2 could well fall far
behind S1 during those long transactions and if S1 goes down at that
point there would be a backlog to apply. But again, this only applies
to "apply" mode sync rep.

So it can occur in both cases, though it now looks to me that its less
important an issue in either case. So I think this doesn't rate the term
dangerous to describe it any longer.

Thanks for your careful thought and analysis on this.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 06:31:10
Message-ID:	4CAC17AE.9030506@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06.10.2010 01:14, Josh Berkus wrote:
> Last I checked, our goal with synch standby was to increase availablity,
> not decrease it.

No. Synchronous replication does not help with availability. It allows
you to achieve zero data loss, ie. if the master dies, you are
guaranteed that any transaction that was acknowledged as committed, is
still committed.

The other use case is keeping a hot standby server (or servers)
up-to-date, so that you can run queries against it and you are
guaranteed to get the same results you would if you ran the query in the
master.

Those are the two reasonable use cases I've seen. Anything else that has
been discussed is some sort of a combination of those two, or something
that doesn't make much sense when you scratch the surface and start
looking at the failure modes.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 07:00:21
Message-ID:	4CAC1E85.7000809@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06.10.2010 01:14, Josh Berkus wrote:
>>> You start a new one from the latest base backup and let it catch up?
>>> Possibly modifying the config file in the master to let it know about
>>> the new standby, if we go down that path. This part doesn't seem
>>> particularly hard to me.
>>
>> Agreed, not sure of the issue there.
>
> See previous post. The critical phrase is *without restarting the
> master*. AFAICT, no patch has addressed the need to change the master's
> synch configuration without restarting it. It's possible that I'm not
> following something, in which case I'd love to have it pointed out.

Fair enough. I agree it's important that the configuration can be
changed on the fly. It's orthogonal to the other things discussed, so
let's just assume for now that we'll have that. If not in the first
version, it can be added afterwards. "pg_ctl reload" is probably how it
will be done.

There is some interesting behavioral questions there on what happens
when the configuration is changed. Like if you first define that 3 out
of 5 servers must acknowledge, and you have an in-progress commit that
has received 2 acks already. If you then change the config to "2 out of
4" servers must acknowledge, is the in-progress commit now satisfied?
From the admin point of view, the server that was removed from the
system might've been one that had acknowledged already, and logically in
the new configuration the transaction has only received 1 acknowledgment
from those servers that are still part of the system. Explicitly naming
the standbys in the config file would solve that particular corner case,
but it would no doubt introduce other similar ones.

But it's an orthogonal issue, we'll figure it out when we get there.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 07:35:04
Message-ID:	4CAC26A8.5040907@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/06/2010 04:31 AM, Simon Riggs wrote:
> That situation would require two things
> * First, you have set up async replication and you're not monitoring it
> properly. Shame on you.

The way I read it, Jeff is complaining about the timeout you propose
that effectively turns sync into async replication in case of a failure.

With a master that waits forever, the standby that's newly required for
quorum certainly still needs its time to catch up. But it wouldn't live
in danger of being "optimized away" for availability in case it cannot
catch up within the given timeout. It's a tradeoff between availability
and durability.

> So it can occur in both cases, though it now looks to me that its less
> important an issue in either case. So I think this doesn't rate the term
> dangerous to describe it any longer.

The proposed timeout certainly still sounds dangerous to me. I'd rather
recommend setting it to an incredibly huge value to minimize its dangers
and get sync replication when that is what has been asked for. Use async
replication for increased availability.

Or do you envision any use case that requires a quorum of X standbies
for normal operation but is just fine with only none to (X-1) standbies
in case of failures? IMO that's when sync replication is most needed and
when it absolutely should hold to its promises - even if it means to
stop the system.

There's no point in continuing operation if you cannot guarantee the
minimum requirements for durability. If you happen to want such a thing,
you should better rethink your minimum requirement (as performance for
normal operations might benefit from a lower minimum as well).

Regards

Markus Wanner

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 08:01:55
Message-ID:	AANLkTimkkrCrtj5LD6m9LyP8taiQUzx4+U9SU+OTYahj@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 6, 2010 at 10:52 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> I'm not sure I entirely understand. I was concerned about the case of a
> standby server being allowed to lag behind the rest by a large number of
> WAL records. That can't happen in the "wait for all servers to apply"
> case, because the system would become unavailable rather than allow a
> significant difference in the amount of WAL applied.
>
> I'm not saying that an unavailable system is good, but I don't see how
> my particular complaint applies to the "wait for all servers to apply"
> case.
>
> The case I was worried about is:
> * 1 master and 2 standby
> * The rule is "wait for at least one standby to apply the WAL"
>
> In your notation, I believe that's M -> { S1, S2 }
>
> In that case, if one S1 is just a little faster than S2, then S2 might
> build up a significant queue of unapplied WAL. Then, when S1 goes down,
> there's no way for the slower one to acknowledge a new transaction
> without playing through all of the unapplied WAL.
>
> Intuitively, the administrator would think that he was getting both HA
> and redundancy, but in reality the availability is no better than if
> there were only two servers (M -> S1), except that it might be faster to
> replay the WAL then to set up a new standby (but that's not guaranteed).

Agreed. This is similar to my previous complaint.
http://archives.postgresql.org/pgsql-hackers/2010-09/msg00946.php

This problem would happen even if we fix the quorum to 1 as Josh propose.
To avoid this, the master must wait for ACK from all the connected
synchronous standbys.

I think that this is likely to happen especially when we choose 'apply'
replication level. Because that level can easily lag a synchronous
standby because of the conflict between recovery and read-only query.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 08:06:27
Message-ID:	4CAC2E03.10705@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/06/2010 08:31 AM, Heikki Linnakangas wrote:
> On 06.10.2010 01:14, Josh Berkus wrote:
>> Last I checked, our goal with synch standby was to increase availablity,
>> not decrease it.
>
> No. Synchronous replication does not help with availability. It allows
> you to achieve zero data loss, ie. if the master dies, you are
> guaranteed that any transaction that was acknowledged as committed, is
> still committed.

Strictly speaking, it even reduces availability. Which is why nobody
actually wants *only* synchronous replication. Instead they use quorum
commit or semi-synchronous (shudder) replication, which only requires
*some* nodes to be in sync, but effectively replicates asynchronously to
the others.

From that point of view, the requirement of having one synch and two
async standbies is pretty much the same as having three synch standbies
with a quorum commit of 1. (Except for additional availability of the
later variant, because in case of a failure of the one sync standby, any
of the others can take over without admin intervention).

Regards

Markus Wanner

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 08:09:08
Message-ID:	AANLkTikuu3PrarLUV=TtTqeLwqxN6pMUwEnzY_Uf2fAk@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 6, 2010 at 3:31 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> No. Synchronous replication does not help with availability. It allows you
> to achieve zero data loss, ie. if the master dies, you are guaranteed that
> any transaction that was acknowledged as committed, is still committed.

Hmm.. but we can increase availability without any data loss by using
synchronous
replication. Many people have already been using synchronous
replication softwares
such as DRBD for that purpose.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 08:17:31
Message-ID:	4CAC309B.4000206@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06.10.2010 11:09, Fujii Masao wrote:
> On Wed, Oct 6, 2010 at 3:31 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> No. Synchronous replication does not help with availability. It allows you
>> to achieve zero data loss, ie. if the master dies, you are guaranteed that
>> any transaction that was acknowledged as committed, is still committed.
>
> Hmm.. but we can increase availability without any data loss by using
> synchronous
> replication. Many people have already been using synchronous
> replication softwares
> such as DRBD for that purpose.

Sure, but it's not the synchronous aspect that increases availability.
It's the replication aspect, and we already have that. Making the
replication synchronous allows zero data loss in case the master
suddenly dies, but it comes at the cost of availability.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 08:39:21
Message-ID:	4CAC35B9.1030608@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/06/2010 10:17 AM, Heikki Linnakangas wrote:
> On 06.10.2010 11:09, Fujii Masao wrote:
>> Hmm.. but we can increase availability without any data loss by using
>> synchronous
>> replication. Many people have already been using synchronous
>> replication softwares
>> such as DRBD for that purpose.
>
> Sure, but it's not the synchronous aspect that increases availability.
> It's the replication aspect, and we already have that.

..the *asynchronous* replication aspect, yes.

The drdb.conf man page [1] describes parameters of DRDB. It's worth
noting that even in "Protocol C" (synchronous mode), they sport a
timeout of only 6 seconds (by default).

After that, the primary node proceeds without any kind of guarantee
(which can be thought of as switching to async replication). Just as
Simon proposes for Postgres as well.

Maybe that really is enough for now. Everybody that needs stricter
durability guarantees needs to wait for Postgres-R ;-)

Regards

Markus Wanner

[1]: drdb.conf man page:
http://www.drbd.org/users-guide/re-drbdconf.html

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 08:49:01
Message-ID:	AANLkTinpiycX0Dj5tBHijHJAo4xHM+fUNDG74LWn7TvY@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 6, 2010 at 5:17 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 06.10.2010 11:09, Fujii Masao wrote:
>>
>> On Wed, Oct 6, 2010 at 3:31 PM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>
>>> No. Synchronous replication does not help with availability. It allows
>>> you
>>> to achieve zero data loss, ie. if the master dies, you are guaranteed
>>> that
>>> any transaction that was acknowledged as committed, is still committed.
>>
>> Hmm.. but we can increase availability without any data loss by using
>> synchronous
>> replication. Many people have already been using synchronous
>> replication softwares
>> such as DRBD for that purpose.
>
> Sure, but it's not the synchronous aspect that increases availability. It's
> the replication aspect, and we already have that. Making the replication
> synchronous allows zero data loss in case the master suddenly dies, but it
> comes at the cost of availability.

Yep. But I mean that the synchronous aspect is helpful to increase the
availability of the system which requires no data loss. In asynchronous
replication, when the master goes down, we have to salvage the missing
WAL for the standby from the failed master to avoid data loss. This would
take very long and decrease the availability of the system which doesn't
accept any data loss. Since the synchronous doesn't require such a salvage,
it can increase the availability of such a system.

If we want only no data loss, we have only to implement the wait-forever
option. But if we make consideration for the above-mentioned availability,
the return-immediately option also would be required.

In some (many, I think) cases, I think that we need to consider availability
and no data loss together, and consider the balance of them.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 08:53:11
Message-ID:	4CAC38F7.1020406@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06.10.2010 11:39, Markus Wanner wrote:
> On 10/06/2010 10:17 AM, Heikki Linnakangas wrote:
>> On 06.10.2010 11:09, Fujii Masao wrote:
>>> Hmm.. but we can increase availability without any data loss by using
>>> synchronous
>>> replication. Many people have already been using synchronous
>>> replication softwares
>>> such as DRBD for that purpose.
>>
>> Sure, but it's not the synchronous aspect that increases availability.
>> It's the replication aspect, and we already have that.
>
> ..the *asynchronous* replication aspect, yes.
>
> The drdb.conf man page [1] describes parameters of DRDB. It's worth
> noting that even in "Protocol C" (synchronous mode), they sport a
> timeout of only 6 seconds (by default).

Wow, that is really short. Are you sure? I have no first hand experience
with DRBD, and reading that man page, I get the impression that the
timeout us just for deciding that the TCP connection is dead. There is
also the ko-count parameter, which defaults to zero. I would guess that
ko-count=0 is "wait forever", while ko-count=1 is what you described,
but I'm not sure.

It's not hard to imagine the master failing in a way that first causes
the connection to standby to drop, and the disk failing 6 seconds later.
A fire that destroys the network cable first and then spreads to the
disk array for example.

> [1]: drdb.conf man page:
> http://www.drbd.org/users-guide/re-drbdconf.html

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 09:00:00
Message-ID:	4CAC3A90.6040007@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06.10.2010 11:49, Fujii Masao wrote:
> On Wed, Oct 6, 2010 at 5:17 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Sure, but it's not the synchronous aspect that increases availability. It's
>> the replication aspect, and we already have that. Making the replication
>> synchronous allows zero data loss in case the master suddenly dies, but it
>> comes at the cost of availability.
>
> Yep. But I mean that the synchronous aspect is helpful to increase the
> availability of the system which requires no data loss. In asynchronous
> replication, when the master goes down, we have to salvage the missing
> WAL for the standby from the failed master to avoid data loss. This would
> take very long and decrease the availability of the system which doesn't
> accept any data loss. Since the synchronous doesn't require such a salvage,
> it can increase the availability of such a system.

In general, salvaging the WAL that was not sent to the standby yet is
outright impossible. You can't achieve zero data loss with asynchronous
replication at all.

> If we want only no data loss, we have only to implement the wait-forever
> option. But if we make consideration for the above-mentioned availability,
> the return-immediately option also would be required.
>
> In some (many, I think) cases, I think that we need to consider availability
> and no data loss together, and consider the balance of them.

If you need both, you need three servers as Simon pointed out earlier.
There is no way around that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 09:11:07
Message-ID:	4CAC3D2B.5020902@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/06/2010 10:53 AM, Heikki Linnakangas wrote:
> Wow, that is really short. Are you sure? I have no first hand experience
> with DRBD,

Neither do I.

> and reading that man page, I get the impression that the
> timeout us just for deciding that the TCP connection is dead. There is
> also the ko-count parameter, which defaults to zero. I would guess that
> ko-count=0 is "wait forever", while ko-count=1 is what you described,
> but I'm not sure.

Yeah, sounds more likely. Then I'm surprised that I didn't find any
warning that the Protocol C definitely reduces availability (with the
ko-count=0 default, that is). Instead, they only state that it's the
most used replication mode, which really makes me wonder. [1]

Sorry for adding confusion by not researching properly.

Regards

Markus Wanner

[1] DRDB Repliaction Modes
http://www.drbd.org/users-guide-emb/s-replication-protocols.html

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 10:41:11
Message-ID:	AANLkTimNq6fEDa_P+SLOZMeYO2rOkmjGsQzKPAMBpikH@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 6, 2010 at 10:17, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 06.10.2010 11:09, Fujii Masao wrote:
>>
>> On Wed, Oct 6, 2010 at 3:31 PM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>
>>> No. Synchronous replication does not help with availability. It allows
>>> you
>>> to achieve zero data loss, ie. if the master dies, you are guaranteed
>>> that
>>> any transaction that was acknowledged as committed, is still committed.
>>
>> Hmm.. but we can increase availability without any data loss by using
>> synchronous
>> replication. Many people have already been using synchronous
>> replication softwares
>> such as DRBD for that purpose.
>
> Sure, but it's not the synchronous aspect that increases availability. It's
> the replication aspect, and we already have that. Making the replication
> synchronous allows zero data loss in case the master suddenly dies, but it
> comes at the cost of availability.

That's only for a narrow definition of availability. For a lot of
people, having access to your data isn't considered availability
unless you can trust the data...

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 11:11:14
Message-ID:	4CAC5952.3060204@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06.10.2010 13:41, Magnus Hagander wrote:
> That's only for a narrow definition of availability. For a lot of
> people, having access to your data isn't considered availability
> unless you can trust the data...

Ok, fair enough. For that, synchronous replication in the "wait forever"
mode is the only alternative. That on its own doesn't give you any boost
in availability, on the contrary, but coupled with suitable clustering
tools to handle failover and deciding when the standby is dead, you can
achieve that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 12:22:00
Message-ID:	m2zkurldef.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner <markus(at)bluegap(dot)ch> writes:
> On 10/06/2010 04:31 AM, Simon Riggs wrote:
>> That situation would require two things
>> * First, you have set up async replication and you're not monitoring it
>> properly. Shame on you.
>
> The way I read it, Jeff is complaining about the timeout you propose
> that effectively turns sync into async replication in case of a failure.
>
> With a master that waits forever, the standby that's newly required for
> quorum certainly still needs its time to catch up. But it wouldn't live
> in danger of being "optimized away" for availability in case it cannot
> catch up within the given timeout. It's a tradeoff between availability
> and durability.

What is necessary here is a clear view on the possible states that a
standby can be in at any time, and we must stop trying to apply to
some non-ready standby the behavior we want when it's already in-sync.

From my experience operating londiste, those states would be:

1. base-backup — self explaining
2. catch-up — getting the WAL to catch up after base backup
3. wanna-sync — don't yet have all the WAL to get in sync
4. do-sync — all WALs are there, coming soon
5. ok (async | recv | fsync | reply — feedback loop engaged)

So you only consider that a standby is a candidate for sync rep when
it's reached the ok state, and that's when it's able to fill the
feedback loop we've been talking about. Standby state != ok, no waiting
no nothing, it's *not* a standby as far as the master is concerned.

The other states allow to manage accepting a new standby into an
existing setup, and to manage error failures. When we stop receiving the
feedback loop events, the master knows the slave ain't in the "ok" state
any more and can demote it to "wanna-sync", because it has to keep WALs
until the slave comes again.

If the standby is not back online and wal_keep_segments makes it so that
we can't keep its wal anymore, the state gets back to "base-backup".

Not going into every details here (for example, we might need some
protocol arbitrage for the standby to be able to explain the master that
it's ok even if the master thinks it's not), but my point is that
without a clear list of standby states, we're going to hinder the master
in situations where it makes no sense to do so.

> Or do you envision any use case that requires a quorum of X standbies
> for normal operation but is just fine with only none to (X-1) standbies
> in case of failures? IMO that's when sync replication is most needed and
> when it absolutely should hold to its promises - even if it means to
> stop the system.
>
> There's no point in continuing operation if you cannot guarantee the
> minimum requirements for durability. If you happen to want such a thing,
> you should better rethink your minimum requirement (as performance for
> normal operations might benefit from a lower minimum as well).

This part of the discussion made me think of yet another refinement on
the Quorum Commit idea, even if I'm beginning to think that can be
material for later.

Basic Quorum Commit is having each transaction on the master wait for a
total number of votes to accept the transaction synced. Each standby has
a weight, meaning 1 or more votes. The problem is the flexibility isn't
there, some cases are impossible to setup. Also people want to be able
to specify their favorite standby and that's quickly uneasy.

Idea : segment the votes into "colors" or any categories you like. Have
each standby be a member of a category list, and require per-category
quorums to be reached. This is the same as attributing roles to standbys
and saying that they're all equivalent as soon as part of the given
role, with the added flexibility that you can sometime want more than
one standby of a given role to take part of the quorum.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 12:26:51
Message-ID:	4CAC6B0B.7000308@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06.10.2010 15:22, Dimitri Fontaine wrote:
> What is necessary here is a clear view on the possible states that a
> standby can be in at any time, and we must stop trying to apply to
> some non-ready standby the behavior we want when it's already in-sync.
>
> From my experience operating londiste, those states would be:
>
> 1. base-backup — self explaining
> 2. catch-up — getting the WAL to catch up after base backup
> 3. wanna-sync — don't yet have all the WAL to get in sync
> 4. do-sync — all WALs are there, coming soon
> 5. ok (async | recv | fsync | reply — feedback loop engaged)
>
> So you only consider that a standby is a candidate for sync rep when
> it's reached the ok state, and that's when it's able to fill the
> feedback loop we've been talking about. Standby state != ok, no waiting
> no nothing, it's *not* a standby as far as the master is concerned.

You're not going to get zero data loss that way. Can you elaborate what
the use case for that mode is?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Markus Wanner <markus(at)bluegap(dot)ch>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 14:20:47
Message-ID:	1286374847.2304.101.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-10-06 at 15:26 +0300, Heikki Linnakangas wrote:

> You're not going to get zero data loss that way.

Ending the wait state does not cause data loss. It puts you at *risk* of
data loss, which is a different thing entirely.

If you want to avoid data loss you use N+k redundancy and get on with
life, rather than sitting around waiting.

Putting in a feature for people that choose k=0 seems wasteful to me,
since they knowingly put themselves at risk in the first place.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 15:02:29
Message-ID:	m2tykzl5yy.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> 1. base-backup — self explaining
>> 2. catch-up — getting the WAL to catch up after base backup
>> 3. wanna-sync — don't yet have all the WAL to get in sync
>> 4. do-sync — all WALs are there, coming soon
>> 5. ok (async | recv | fsync | reply — feedback loop engaged)
>>
>> So you only consider that a standby is a candidate for sync rep when
>> it's reached the ok state, and that's when it's able to fill the
>> feedback loop we've been talking about. Standby state != ok, no waiting
>> no nothing, it's *not* a standby as far as the master is concerned.
>
> You're not going to get zero data loss that way. Can you elaborate what the
> use case for that mode is?

You can't pretend to sync with zero data loss until the standby is ready
for it, or you need to take the site down while you add your standby.

I can see some user willing to take the site down while doing the base
backup dance then waiting for initial sync, then only accepting traffic
and being secure against data loss, but I'd much rather that be an
option and you could watch for your standby's state in a system view.

Meanwhile, I can't understand any reason for the master to pretend it
can safely manage any sync-rep transaction while there's no standby
around. Either you wait for the quorum and don't have it, or you have to
track standby states with precision and maybe actively reject writes.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Markus Wanner <markus(at)bluegap(dot)ch>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 15:04:11
Message-ID:	4CAC8FEB.2040300@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06.10.2010 17:20, Simon Riggs wrote:
> On Wed, 2010-10-06 at 15:26 +0300, Heikki Linnakangas wrote:
>
>> You're not going to get zero data loss that way.
>
> Ending the wait state does not cause data loss. It puts you at *risk* of
> data loss, which is a different thing entirely.

Looking at it that way, asynchronous replication just puts you at risk
of data loss too, it doesn't necessarily mean you get data loss.

The key is whether you are guaranteed to have zero data loss or not. If
you don't wait forever, you're not guaranteed zero data loss. It's just
best effort, like asynchronous replication. The situation you want to
avoid is that the master dies, and you don't know if you have suffered
data loss or not.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 15:07:51
Message-ID:	4CAC90C7.4080502@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06.10.2010 18:02, Dimitri Fontaine wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>>> 1. base-backup — self explaining
>>> 2. catch-up — getting the WAL to catch up after base backup
>>> 3. wanna-sync — don't yet have all the WAL to get in sync
>>> 4. do-sync — all WALs are there, coming soon
>>> 5. ok (async | recv | fsync | reply — feedback loop engaged)
>>>
>>> So you only consider that a standby is a candidate for sync rep when
>>> it's reached the ok state, and that's when it's able to fill the
>>> feedback loop we've been talking about. Standby state != ok, no waiting
>>> no nothing, it's *not* a standby as far as the master is concerned.
>>
>> You're not going to get zero data loss that way. Can you elaborate what the
>> use case for that mode is?
>
> You can't pretend to sync with zero data loss until the standby is ready
> for it, or you need to take the site down while you add your standby.
>
> I can see some user willing to take the site down while doing the base
> backup dance then waiting for initial sync, then only accepting traffic
> and being secure against data loss, but I'd much rather that be an
> option and you could watch for your standby's state in a system view.
>
> Meanwhile, I can't understand any reason for the master to pretend it
> can safely manage any sync-rep transaction while there's no standby
> around. Either you wait for the quorum and don't have it, or you have to
> track standby states with precision and maybe actively reject writes.

I'm sorry, but I still don't understand the use case you're envisioning.
How many standbys are there? What are you trying to achieve with
synchronous replication over what asynchronous offers?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 15:12:52
Message-ID:	4CAC91F4.5020903@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/06/2010 04:20 PM, Simon Riggs wrote:
> Ending the wait state does not cause data loss. It puts you at *risk* of
> data loss, which is a different thing entirely.

These kind of risk scenarios is what sync replication is all about. A
minimum guarantee that doesn't hold in face of the first few failures
(see Jeff's argument) isn't worth a dime.

Keep in mind that upon failure, the other nodes presumably get more
load. As has been seen with RAID, that easily leads to subsequent
failures. Sync rep needs to be able to protect against that *as well*.

> If you want to avoid data loss you use N+k redundancy and get on with
> life, rather than sitting around waiting.

With that notion, I'd argue that quorum_commit needs to be set to
exactly k, because any higher value would only cost performance without
any useful benefit.

But if I want at least k ACKs and if I think it's worth the performance
penalty that brings during normal operation, I want that guarantee to
hold true *especially* in case of an emergency.

If availability is more important, you need to increase N and make sure
enough of these (asynchronously) replicated nodes stay up. Increase k
(thus quorum commit) for a stronger durability guarantee.

> Putting in a feature for people that choose k=0 seems wasteful to me,
> since they knowingly put themselves at risk in the first place.

Given the above logic, k=0 equals to completely async replication. Not
sure what's wrong about that.

Regards

Markus Wanner

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 15:41:07
Message-ID:	m2bp77l46k.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> I'm sorry, but I still don't understand the use case you're envisioning. How
> many standbys are there? What are you trying to achieve with synchronous
> replication over what asynchronous offers?

Sorry if I've been unclear, I read loads of message then tried to pick
up the right one to answer, and obviously missed to spell out some
context.

My concern starts with only 1 standby, and is in fact 2 questions:

- Why o why you wouldn't be able to fix your sync setup in the master
as soon as there's a standby doing a base backup?

- when do you start considering the standby as a candidate to your sync
rep requirements?

Lots of the discussion we're having are taking as an implicit that the
answer is "as soon as you know about its existence, that must be at the
pg_start_backup() point". I claim that's incorrect, and you can't ask
the master to wait forever until the standby is in sync. All the more
because there's a window with wal_keep_segments here too, so the sync
might never happen.

To solve that problem, I propose managing current state of the
standby.

That means auto registration of any standby, and feedback loop at more
stages, and some protocol arbitrage for the standby to be able to say
"I'm this far actually" so that the master can know how to consider
it, rather than just demote it while live.

One you have a clear list of possible states for a standby, and can
decide on what errors are meaning in terms of transitions in the state
machine, you're able to decide when wait forever is an option and when
you should ignore it or refuse any side-effect transaction commit.

And you can offer an option to guarantee the wait-forever behavior only
when it makes sense, rather than trying to catch your own tail as soon
as a standby is added in the mix, with the proposals I've read on how
you can't even restart the master at this point.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 17:57:57
Message-ID:	4CACB8A5.2040906@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

All,

Let me clarify and consolidate this discussion. Again, it's my goal
that this thread specifically identify only the problems and desired
behaviors for synch rep with more than one sync standby. There are
several issues with even one sync standby which still remain unresolved,
but I believe that we should discuss those on a separate thread, for
clarity.

I also strongly believe that we should get single-standby functionality
committed and tested *first*, before working further on multi-standby.

So, to summarize earlier discussion on this thread:

There are 2 reasons to have more than one sync standby:

1) To increase durability above the level of a single synch standby,
even at the cost of availability.

2) To increase availability without decreasing durability below the
level offered by a single sync standby.

The "pure" setup for each of these options, where N is the number of
standbys and k is the number of acks required from standbys is:

1) k = N, N > 1, apply
2) k = 1, N > 1, recv

(Timeouts are a specific compromise of durability for availability on
*one* server, and as such will not be discussed here. BTW, I was the
one who suggested a timeout, rather than Simon, so if you don't like the
idea, harass me about it.)

Any other configuration (3) than the two above is a specific compromise
between durability and availability, for example:

3a) k = 2, N = 3, fsync
3b) k = 3, N = 10, recv

... should give you better durability than case 2) and better
availability than case 1).

While it's nice to dismiss case (1) as an edge-case, consider the
likelyhood of someone running PostgreSQL with fsync=off on cloud
hosting. In that case, having k = N = 5 does not seem like an
unreasonable arrangement if you want to ensure durability via
replication. It's what the CAP databases do.

After eliminating some of my issues as non-issues, here's what we're
left with for problems on the above:

(1), (3) Accounting/Registration. Implementing any of these cases would
seem to require some form of accounting and/or registration on the
master in terms of, at a minimum, the number of acks for each data send.
More likely we will need, as proposed on other threads, a register of
standbys and the sync state of each. Not only will this
accounting/registration be hard code to write, it will have at least
*some* performance overhead. Whether that overhead is minority or
substantial can only be determined through testing. Further, there's
the issue of whether, and how, we transmit this register to the standbys
so that they can be promoted.

(2), (3) Degradation: (Jeff) these two cases make sense only if we give
DBAs the tools they need to monitor which standbys are falling behind,
and to drop and replace those standbys. Otherwise we risk giving DBAs
false confidence that they have better-than-1-standby reliability when
actually they don't. Current tools are not really adequate for this.

(1), (3) Dynamic Re-configuration: we need the ability to add and remove
standbys at runtime. We also need to have a verdict on how to handle
the case where a transaction is pending, per Heikki.

(2), (3) Promotion: all multi-standby high-availability cases only make
sense if we provide tools to promote the most current standby to be the
new master. Otherwise the whole cluster still goes down whenever we
have to replace the master. We also should provide some mechanism for
promoting an async standby to sync; this has already been discussed.

(1) Consistency: this is another DBA-false-confidence issue. DBAs who
implement (1) are liable to do so thinking that they are not only
guaranteeing the consistency of every standby with the master, but the
consistency of every standby with every other standby -- a kind of dummy
multi-master. They are not, so it will take multiple reminders and
workarounds in the docs to explain this. And we'll get complaints anyway.

(1), (2), (3) Initialization: (Dimitri) we need a process whereby a
standby can go from cloned to synched to being a sync rep standby, and
possibly from degraded to synced again and back.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 18:01:27
Message-ID:	4CACB977.3020608@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Dimitri,

On 10/06/2010 05:41 PM, Dimitri Fontaine wrote:
> - when do you start considering the standby as a candidate to your sync
> rep requirements?

That question doesn't make much sense to me. There's no point in time I
ever mind if a standby is a "candidate" or not. Either I want to
synchronously replicate to X standbies, or not.

> Lots of the discussion we're having are taking as an implicit that the
> answer is "as soon as you know about its existence, that must be at the
> pg_start_backup() point".

This is an admin decision. Whether or not your standbies are up and
running or not, existing or just about to be bought, that doesn't have
any impact on your durability requirements. If you want your banking
accounts data to be saved in at least two different locations, I think
that's your requirement.

You'd be quite unhappy if your bank lost your last month's salary, but
stated: "hey, at least we didn't have any downtime".

> And you can offer an option to guarantee the wait-forever behavior only
> when it makes sense, rather than trying to catch your own tail as soon
> as a standby is added in the mix

Of course, it doesn't make sense to wait-forever on *every* standby that
ever gets added. Quorum commit is required, yes (and that's what this
thread is about, IIRC). But with quorum commit, adding a standby only
improves availability, but certainly doesn't block the master in any
way. (Quite the opposite: it can allow the master to continue, if it has
been blocked before because the quorum hasn't been reached).

Regards

Markus Wanner

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 19:04:11
Message-ID:	m2eic3jg7o.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner <markus(at)bluegap(dot)ch> writes:
> There's no point in time I
> ever mind if a standby is a "candidate" or not. Either I want to
> synchronously replicate to X standbies, or not.

Ok so I think we're agreeing here: what I said amounts to propose that
the code does work this way when the quorum is such setup, and/or is
able to reject any non-read-only transaction (those that needs a real
XID) until your standby is fully in sync.

I'm just saying that this should be an option, not the only choice.

And that by having a clear view of the system's state, it's possible to
have a clear error response policy set out.

> This is an admin decision. Whether or not your standbies are up and
> running or not, existing or just about to be bought, that doesn't have
> any impact on your durability requirements.

Depends, lots of things out there work quite well in best effort mode,
even if some projects needs more careful thinking. That's again the idea
of waiting forever or just continuing, there's a middle-ground which is
starting the system before reaching the durability requirements or
downgrading it to read only, or even off, until you get them.

You can read my proposal as a way to allow our users to choose between
those two incompatible behaviours.

> Of course, it doesn't make sense to wait-forever on *every* standby that
> ever gets added. Quorum commit is required, yes (and that's what this
> thread is about, IIRC). But with quorum commit, adding a standby only
> improves availability, but certainly doesn't block the master in any
> way. (Quite the opposite: it can allow the master to continue, if it has
> been blocked before because the quorum hasn't been reached).

If you ask for a quorum larger than what the current standbys are able
to deliver, and you're set to wait forever until the quorum is reached,
you just blocked the master.

Good news is that the quorum is a per-transaction setting, so opening a
superuser connection to act on the currently waiting transaction is
still possible (pass/fail, but fail is what at this point? shutdown to
wait some more offline?).

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 19:35:44
Message-ID:	4CACCF90.6050107@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06.10.2010 20:57, Josh Berkus wrote:
> While it's nice to dismiss case (1) as an edge-case, consider the
> likelyhood of someone running PostgreSQL with fsync=off on cloud
> hosting. In that case, having k = N = 5 does not seem like an
> unreasonable arrangement if you want to ensure durability via
> replication. It's what the CAP databases do.

Seems reasonable, but what is a CAP database?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 19:51:36
Message-ID:	4CACD348.9060108@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/06/2010 09:04 PM, Dimitri Fontaine wrote:
> Ok so I think we're agreeing here: what I said amounts to propose that
> the code does work this way when the quorum is such setup, and/or is
> able to reject any non-read-only transaction (those that needs a real
> XID) until your standby is fully in sync.
>
> I'm just saying that this should be an option, not the only choice.

I'm sorry, I just don't see the use case for a mode that drops
guarantees when they are most needed. People who don't need those
guarantees should definitely go for async replication instead.

What does a synchronous replication mode that falls back to async upon
failure give you, except for a severe degradation in performance during
normal operation? Why not use async right away in such a case?

> Depends, lots of things out there work quite well in best effort mode,
> even if some projects needs more careful thinking. That's again the idea
> of waiting forever or just continuing, there's a middle-ground which is
> starting the system before reaching the durability requirements or
> downgrading it to read only, or even off, until you get them.

In such cases the admin should be free to reconfigure the quorum. And
yes, a read-only mode might be feasible. Please just don't fool the
admin with a "best effort" things that guarantees nothing (but trouble).

> If you ask for a quorum larger than what the current standbys are able
> to deliver, and you're set to wait forever until the quorum is reached,
> you just blocked the master.

Correct. That's the intended behavior.

> Good news is that the quorum is a per-transaction setting

I definitely like the per-transaction thing.

> so opening a
> superuser connection to act on the currently waiting transaction is
> still possible (pass/fail, but fail is what at this point? shutdown to
> wait some more offline?).

Not sure I'm following here. The admin will be busy re-establishing
(connections to) standbies, killing transactions on the master doesn't
help anything - whether or not the master waits forever.

Regards

Markus Wanner

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 19:53:24
Message-ID:	4CACD3B4.60501@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Seems reasonable, but what is a CAP database?

Database based around the CAP theorem[1]. Cassandra, Dynamo,
Hypertable, etc.

For us, the equation is: CAD, as in Consistency, Availability,
Durability. Pick any two, at best. But it's a very similar bag of
issues as the ones CAP addresses.

[1]http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Markus Wanner <markus(at)bluegap(dot)ch>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-06 20:01:13
Message-ID:	1286395273.2304.114.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-10-06 at 18:04 +0300, Heikki Linnakangas wrote:

> The key is whether you are guaranteed to have zero data loss or not.

We agree that is an important question.

You seem willing to trade anything for that guarantee. I seek a more
pragmatic approach that balances availability and risk.

Those views are different, but not inconsistent. Oracle manages to offer
multiple options and so can we.

If you desire that, go for it. But don't try to stop others having a
simple, pragmatic approach. The code to implement your desired option is
more complex and really should come later. I don't in any way wish to
block that option in this release, or any other, but please don't try to
persuade people it's the only sensible option 'cos it damn well isn't.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 07:30:56
Message-ID:	1286436656.2304.187.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-10-06 at 10:57 -0700, Josh Berkus wrote:

> I also strongly believe that we should get single-standby
> functionality committed and tested *first*, before working further on
> multi-standby.

Yes, lets get k = 1 first.

With k = 1 the number of standbys is not limited, so we can still have
very robust and highly available architectures. So we mean
"first-acknowledgement-releases-waiters".

> (1) Consistency: this is another DBA-false-confidence issue. DBAs who
> implement (1) are liable to do so thinking that they are not only
> guaranteeing the consistency of every standby with the master, but the
> consistency of every standby with every other standby -- a kind of
> dummy multi-master. They are not, so it will take multiple reminders
> and workarounds in the docs to explain this. And we'll get complaints
> anyway.

This puts the matter very clearly. Setting k = N is not as good an idea
as it sounds when first described.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 07:32:50
Message-ID:	1286436770.2304.189.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-10-06 at 10:57 -0700, Josh Berkus wrote:
> (2), (3) Degradation: (Jeff) these two cases make sense only if we
> give
> DBAs the tools they need to monitor which standbys are falling behind,
> and to drop and replace those standbys. Otherwise we risk giving DBAs
> false confidence that they have better-than-1-standby reliability when
> actually they don't. Current tools are not really adequate for this.

Current tools work just fine for identifying if a server is falling
behind. This improved in 9.0 to give fine-grained information. Nothing
more is needed here within the server.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 09:46:00
Message-ID:	4CAD96D8.50405@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/06/2010 10:01 PM, Simon Riggs wrote:
> The code to implement your desired option is
> more complex and really should come later.

I'm sorry, but I think of that exactly the opposite way. The timeout for
automatic continuation after waiting for a standby is the addition. The
wait state of the master is there anyway, whether or not it's bound by a
timeout. The timeout option should thus come later.

Regards

Markus Wanner

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 09:52:13
Message-ID:	m2vd5egwj6.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner <markus(at)bluegap(dot)ch> writes:
>> I'm just saying that this should be an option, not the only choice.
>
> I'm sorry, I just don't see the use case for a mode that drops
> guarantees when they are most needed. People who don't need those
> guarantees should definitely go for async replication instead.

We're still talking about freezing the master and all the applications
when the first standby still has to do a base backup and catch-up to
where the master currently is, right?

> What does a synchronous replication mode that falls back to async upon
> failure give you, except for a severe degradation in performance during
> normal operation? Why not use async right away in such a case?

It's all about the standard case you're building, sync rep, and how to
manage errors. In most cases I want flexibility. Alert says standby is
down, you lost your durability requirements, so now I'm building a new
standby. Does it mean my applications are all off and the master
refusing to work? I sure hope I can choose about that, if possible per
application.

Next step, the old standby has been able to boot again, thanks to the
sysadmins who repaired it, so it's online again, and my replacement
machine is doing a base-backup. Are all the applications still
unavailable? I sure hope I have a word in this decision.

>> so opening a
>> superuser connection to act on the currently waiting transaction is
>> still possible (pass/fail, but fail is what at this point? shutdown to
>> wait some more offline?).
>
> Not sure I'm following here. The admin will be busy re-establishing
> (connections to) standbies, killing transactions on the master doesn't
> help anything - whether or not the master waits forever.

The idea here would be to be able to manually ACK a transaction that's
waiting forever, because you know it won't have an answer and you'd
prefer the application to just continue. But I see that's not a valid
use case for you.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 10:07:48
Message-ID:	4CAD9BF4.90706@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07.10.2010 12:52, Dimitri Fontaine wrote:
> Markus Wanner<markus(at)bluegap(dot)ch> writes:
>>> I'm just saying that this should be an option, not the only choice.
>>
>> I'm sorry, I just don't see the use case for a mode that drops
>> guarantees when they are most needed. People who don't need those
>> guarantees should definitely go for async replication instead.
>
> We're still talking about freezing the master and all the applications
> when the first standby still has to do a base backup and catch-up to
> where the master currently is, right?

Either that, or you configure your system for asynchronous replication
first, and flip the switch to synchronous only after the standby has
caught up. Setting up the first standby happens only once when you
initially set up the system, or if you're recovering from a catastrophic
loss of the standby.

>> What does a synchronous replication mode that falls back to async upon
>> failure give you, except for a severe degradation in performance during
>> normal operation? Why not use async right away in such a case?
>
> It's all about the standard case you're building, sync rep, and how to
> manage errors. In most cases I want flexibility. Alert says standby is
> down, you lost your durability requirements, so now I'm building a new
> standby. Does it mean my applications are all off and the master
> refusing to work?

Yes. That's why you want to have at least two standbys if you care about
availability. Or if durability isn't that important to you after all,
use asynchronous replication.

Of course, if in the heat of the moment the admin is willing to forge
ahead without the standby, he can temporarily change the configuration
in the master. If you want the standby to be rebuilt automatically, you
can even incorporate that configuration change in the scripts too. The
important point is that you or your scripts are in control, and you know
at all times whether you can trust the standby or not. If the master
makes such decisions automatically, you don't know if the standby is
trustworthy (ie. guaranteed up-to-date) or not.

>>> so opening a
>>> superuser connection to act on the currently waiting transaction is
>>> still possible (pass/fail, but fail is what at this point? shutdown to
>>> wait some more offline?).
>>
>> Not sure I'm following here. The admin will be busy re-establishing
>> (connections to) standbies, killing transactions on the master doesn't
>> help anything - whether or not the master waits forever.
>
> The idea here would be to be able to manually ACK a transaction that's
> waiting forever, because you know it won't have an answer and you'd
> prefer the application to just continue. But I see that's not a valid
> use case for you.

I don't see anything wrong with having tools for admins to deal with the
unexpected. I'm not sure overriding individual transactions is very
useful though, more likely you'll want to take the whole server offline,
or you want to change the config to allow all transactions to continue
without the synchronous standby.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 10:32:41
Message-ID:	m21v82gunq.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Either that, or you configure your system for asynchronous replication
> first, and flip the switch to synchronous only after the standby has caught
> up. Setting up the first standby happens only once when you initially set up
> the system, or if you're recovering from a catastrophic loss of the
> standby.

Or if the standby is lagging and the master wal_keep_segments is not
sized big enough. Is that a catastrophic loss of the standby too?

>> It's all about the standard case you're building, sync rep, and how to
>> manage errors. In most cases I want flexibility. Alert says standby is
>> down, you lost your durability requirements, so now I'm building a new
>> standby. Does it mean my applications are all off and the master
>> refusing to work?
>
> Yes. That's why you want to have at least two standbys if you care about
> availability. Or if durability isn't that important to you after all, use
> asynchronous replication.

Agreed, that's a nice simple use case.

Another one is to say that I want sync rep when the standby is
available, but I don't have the budget for more. So I prefer a good
alerting system and low-budget-no-guarantee when the standby is down,
that's my risk evaluation.

> Of course, if in the heat of the moment the admin is willing to forge ahead
> without the standby, he can temporarily change the configuration in the
> master. If you want the standby to be rebuilt automatically, you can even
> incorporate that configuration change in the scripts too. The important
> point is that you or your scripts are in control, and you know at all times
> whether you can trust the standby or not. If the master makes such decisions
> automatically, you don't know if the standby is trustworthy (ie. guaranteed
> up-to-date) or not.

My proposal is that the master has the information to make the decision,
and the behavior is something you setup. Default to security, so wait
forever and block the applications, but could be set to ignore standby
that have not at least reached this state.

I don't see that you can make everybody happy without a knob here, and I
don't see how we can deliver one without a clear state diagram of the
standby possible current states and transitions.

The other alternative is to just don't care and accept the timeout as
being an option with the quorum, so that you just don't wait for the
quorum if so you want. It's much more dynamic and dangerous, but with a
good alerting system it'll be very popular I guess.

> I don't see anything wrong with having tools for admins to deal with the
> unexpected. I'm not sure overriding individual transactions is very useful
> though, more likely you'll want to take the whole server offline, or you
> want to change the config to allow all transactions to continue without the
> synchronous standby.

The question then is, should the new configuration alter running
transactions? My implicit was that I don't think so, and then I need
another facility, such as

SELECT pg_cancel_quorum_wait(procpid)
FROM pg_stat_activity
WHERE waiting_quorum;

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 11:08:48
Message-ID:	1286449728.2304.339.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-10-07 at 11:46 +0200, Markus Wanner wrote:
> On 10/06/2010 10:01 PM, Simon Riggs wrote:
> > The code to implement your desired option is
> > more complex and really should come later.
>
> I'm sorry, but I think of that exactly the opposite way.

I see why you say that. Dimitri's suggestion is an enhancement on the
basic feature, just as Heikki's is. My reply was directed at Heikki, but
should also apply to Dimitri's idea also.

> The timeout for
> automatic continuation after waiting for a standby is the addition. The
> wait state of the master is there anyway, whether or not it's bound by a
> timeout. The timeout option should thus come later.

Adding timeout is very little code. We can take that out of the patch if
that's an objection.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 11:48:49
Message-ID:	4CADB3A1.2010502@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/07/2010 01:08 PM, Simon Riggs wrote:
> Adding timeout is very little code. We can take that out of the patch if
> that's an objection.

Okay. If you take it out, we are at the wait-forever option, right?

If not, I definitely don't understand how you envision things to happen.
I've been asking [1] about that distinction before, but didn't get a
direct answer.

Regards

Markus Wanner

[1]: Re: Configuring synchronous replication, Markus Wanner:
http://archives.postgresql.org/message-id/4C9C5887.4040901@bluegap.ch

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 12:21:32
Message-ID:	AANLkTikKOaCJg=NoYmg1GOXic-uzNAQ9YUY2vCptqQHH@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 7, 2010 at 3:30 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Yes, lets get k = 1 first.
>
> With k = 1 the number of standbys is not limited, so we can still have
> very robust and highly available architectures. So we mean
> "first-acknowledgement-releases-waiters".

+1. I like the design Greg Smith proposed yesterday (though there are
details to be worked out).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 12:54:58
Message-ID:	4CADC322.1000700@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Salut Dimitri,

On 10/07/2010 12:32 PM, Dimitri Fontaine wrote:
> Another one is to say that I want sync rep when the standby is
> available, but I don't have the budget for more. So I prefer a good
> alerting system and low-budget-no-guarantee when the standby is down,
> that's my risk evaluation.

I think that's a pretty special case, because the "good alerting system"
is at least as expensive as another server that just persistently stores
and ACKs incoming WAL.

Why does one ever want the guarantee that sync replication gives to only
hold true up to one failure, if a better guarantee doesn't cost anything
extra? (Note that a "good alerting system" is impossible to achieve with
only two servers. You need a third device anyway).

Or put another way: a "good alerting system" is one that understands
Postgres to some extent. It protects you from data loss in *every* case.
If you attach at least two database servers to it, you get availability
as long as any one of the two is up and running. No matter what happened
before, even a full cluster power outage is guaranteed to recover from
automatically without any data loss.

[ Okay, the standby mode that only stores and ACKs WAL without having a
full database behind still needs to be written. However, pg_streamrecv
certainly goes that direction already, see [1]. ]

Sync replication between really just two servers is asking for trouble
and certainly not worth the savings in hardware cost. Better invest in a
good UPS and redundant power supplies for a single server.

> The question then is, should the new configuration alter running
> transactions?

It should definitely affect all currently running and waiting
transactions. For anything beyond three servers, where quorum_commit
could be bigger than one, it absolutely makes sense to be able to just
lower the requirements temporarily, instead of having to cancel the
guarantee completely.

Regards

Markus Wanner

[1]: Using streaming replication as log archiving, Magnus Hagander
http://archives.postgresql.org/message-id/AANLkTi=_BzsYT8a1KjtpWZxNWyYgqNVp1NbJWRnsD_Nv@mail.gmail.com

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 13:19:49
Message-ID:	m2tykyjg22.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner <markus(at)bluegap(dot)ch> writes:
> Why does one ever want the guarantee that sync replication gives to only
> hold true up to one failure, if a better guarantee doesn't cost anything
> extra? (Note that a "good alerting system" is impossible to achieve with
> only two servers. You need a third device anyway).

I think you're all into durability, and that's good. The extra cost is
service downtime if that's not what you're after: there's also
availability and load balancing read queries on a system with no lag (no
stale data servicing) when all is working right.

I still think your use case is a solid one, but that we need to be ready
to answer to some other ones, that you call relaxed and wrong because of
data loss risks. My proposal is to make the risk window obvious and the
behavior when you enter it configurable.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 13:41:54
Message-ID:	AANLkTinF4c0cjtJ-kMt=tSDsScmsLWaEzfSimbwGsCOX@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 7, 2010 at 6:32 AM, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr> wrote:

> Or if the standby is lagging and the master wal_keep_segments is not
> sized big enough. Is that a catastrophic loss of the standby too?

Sure, but that lagged standy is already asynchrounous, not
synchrounous. If it was synchronous, it would have slowed the master
down enough it would not be lagged.

I'm really confused with all this k < N scenarious I see bandied
about, because, all it really amounts to is "I only want *one*
syncronous replication, and a bunch of synchrounous replications".
And a bit of chance thrown in the mix to hope the "syncronous" one is
pretty stable, the asynchronous ones aren't *too* far behind (define
too and far at your leisure).

And then I see a lot of posturing about how to "recover" when the
"asynchronous standbys" aren't "synchronous enough" at some point...

>
> Agreed, that's a nice simple use case.
>
> Another one is to say that I want sync rep when the standby is
> available, but I don't have the budget for more. So I prefer a good
> alerting system and low-budget-no-guarantee when the standby is down,
> that's my risk evaluation.

That screems wrong in my books:

"OK, I want durability, so I always want to have 2 copies of the data,
but if we loose one, copy, I want to keep on trucking, because I don't
*really* want durability".

If you want most-of-the time mostly 2 copy durabiltiy, then really
good asynchronous replication is a really good solutions.

Yes, I believe you need to have a way for an admin (or
process/control/config) to be able to "demote" a synchronous
replication scenario into async (or "standalone", which is just an
extension of really async). But it's no longer syncronous replication
at that point. And if the choice is made to "keep trucking" while a
new standby is being brought online and available and caught up,
that's fine too. But during that perioud, until the slave is caught
up and synchrounously replicating, it's *not* synchronous replication.

So I'm not arguing that there shouldn't be a way to turn of
synchronous replication once it's on. Hopefully without having to
take down the cluster (pg instance type cluster) But I am pleading
that there is a way to setup PG such that synchronous replication *is*
synchronously replicating, or things stop and backup until such a time
as it is.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 14:08:01
Message-ID:	m2ocb6hz9a.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Aidan Van Dyk <aidan(at)highrise(dot)ca> writes:
> Sure, but that lagged standy is already asynchrounous, not
> synchrounous. If it was synchronous, it would have slowed the master
> down enough it would not be lagged.

Agreed, except in the case of a joining standby. But you're saying it
better than I do:

> Yes, I believe you need to have a way for an admin (or
> process/control/config) to be able to "demote" a synchronous
> replication scenario into async (or "standalone", which is just an
> extension of really async). But it's no longer syncronous replication
> at that point. And if the choice is made to "keep trucking" while a
> new standby is being brought online and available and caught up,
> that's fine too. But during that perioud, until the slave is caught
> up and synchrounously replicating, it's *not* synchronous replication.

That's exactly my point. I think we need to handle the case and make it
obvious that this window is a data-loss window where there's no sync rep
ongoing, then offer users a choice of behaviour.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 14:26:10
Message-ID:	AANLkTim=mpXd-ZdoYN8yCsR80m+YiKMT0oiW3_gT9FcU@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 7, 2010 at 10:08 AM, Dimitri Fontaine
<dimitri(at)2ndquadrant(dot)fr> wrote:
> Aidan Van Dyk <aidan(at)highrise(dot)ca> writes:
>> Sure, but that lagged standy is already asynchrounous, not
>> synchrounous. If it was synchronous, it would have slowed the master
>> down enough it would not be lagged.
>
> Agreed, except in the case of a joining standby.

*shrug* The joining standby is still asynchronous at this point.
It's not synchronous replication. It's just another ^k of the N
slaves serving stale data ;-)

> But you're saying it
> better than I do:
>
>> Yes, I believe you need to have a way for an admin (or
>> process/control/config) to be able to "demote" a synchronous
>> replication scenario into async (or "standalone", which is just an
>> extension of really async). But it's no longer syncronous replication
>> at that point. And if the choice is made to "keep trucking" while a
>> new standby is being brought online and available and caught up,
>> that's fine too. But during that perioud, until the slave is caught
>> up and synchrounously replicating, it's *not* synchronous replication.
>
> That's exactly my point. I think we need to handle the case and make it
> obvious that this window is a data-loss window where there's no sync rep
> ongoing, then offer users a choice of behaviour.

Again, I'm stating there is *no* choice in synchronous replication.
It's *got* to block, otherwise it's not synchronous replication. The
"choice" is if you want synchronous replication or not at that point.

And turning it off might be a good (best) choice for for most people.
I just want to make sure that:
1) There's now way to *sensibly* think it's still "synchronously replicating"
2) There is a way to enforce that the commits happening *are*
synchronously replicating.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 15:43:03
Message-ID:	m239sihuuw.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Aidan Van Dyk <aidan(at)highrise(dot)ca> writes:
> *shrug* The joining standby is still asynchronous at this point.
> It's not synchronous replication. It's just another ^k of the N
> slaves serving stale data ;-)

Agreed *here*, but if you read the threads again, you'll see that's not
at all what's been talked about before my proposal.

In particular, the questions about how to unlock a master's setup while
its synced standby is doing a base backup should not be allowed to
exists, and you seem to agree with my point.

>> That's exactly my point. I think we need to handle the case and make it
>> obvious that this window is a data-loss window where there's no sync rep
>> ongoing, then offer users a choice of behaviour.
>
> Again, I'm stating there is *no* choice in synchronous replication.
> It's *got* to block, otherwise it's not synchronous replication. The
> "choice" is if you want synchronous replication or not at that point.

Exactly, even if I didn't dare spell it this way.

What I want to propose is for the user to be able to configure things so
that he loses the sync aspect of the replication if it so happens that
the setup is not able to provide for it.

It may sound strange, but it's needed when all you want is a no stale
data reporting stanbdy, e.g. And it so happens that it's already in
Simon's code, AFAIUI (yet to read it, see).

> And turning it off might be a good (best) choice for for most people.
> I just want to make sure that:
> 1) There's now way to *sensibly* think it's still "synchronously replicating"
> 2) There is a way to enforce that the commits happening *are*
> synchronously replicating.

We're on the same track. I don't know how to offer your options without
a clear listing of standby states and transitions, which must include
the synchronicity and whether you just lost it or whatnot.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 16:41:55
Message-ID:	4CADF853.30701@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner wrote:
> I think that's a pretty special case, because the "good alerting system"
> is at least as expensive as another server that just persistently stores
> and ACKs incoming WAL.
>

The cost of hardware capable of running a database server is a large
multiple of what you can build an alerting machine for. I have two
systems that are approaching the trash heap just at my house, relative
to the main work I do, but that are fully capable of running an alerting
system. Building a production quality database server requires a more
significant investment: high quality disks, ECC RAM, battery-backed
RAID controller, etc. Relative to what the hardware in a database
server costs, what you need to build an alerting system is almost free.
Oh: and most businesses that are complicated enough to need a serious
database server already have them, so they actually cost nothing beyond
the software setup time to point them toward the databases, too.

> Why does one ever want the guarantee that sync replication gives to only
> hold true up to one failure, if a better guarantee doesn't cost anything
> extra? (Note that a "good alerting system" is impossible to achieve with
> only two servers. You need a third device anyway).
>

I do not disagree with your theory or reasoning. But as a practical
matter, I'm afraid the true cost of the better guarantee you're
suggesting here is additional code complexity that will likely cause
this feature to miss 9.1 altogether. As far as I'm concerned, this
whole diversion into the topic of quorum commit is only consuming
resources away from targeting something achievable in the time frame of
a single release.

> Sync replication between really just two servers is asking for trouble
> and certainly not worth the savings in hardware cost. Better invest in a
> good UPS and redundant power supplies for a single server.
>

I wish I could give you the long list of data recovery projects I've
worked on over the last few years, so you could really appreciate how
much what you're saying here is exactly the opposite of the reality
here. You cannot make a single server reliable enough to survive all of
the things that Murphy's Law will inflict upon it, at any price. For
most of the businesses I work with who want sync rep, data is not
considered safe until the second copy is on storage miles away from the
original, because they know this too.

Personal anecdote I can share: I used to have an important project
related to stock trading where I kept my backup system about 50 miles
away from me. I was aiming for constant availability, while still being
able to drive to the other server if needed for disaster recovery.
Guess what? Even those two turned out not to be nearly independent
enough; see http://en.wikipedia.org/wiki/Northeast_Blackout_of_2003 for
details of how I lost both of those at the same time for days. Silly
me, I'd only spread them across two adjacent states with different power
providers! Not nearly good enough to avoid a correlated failure.

--
Greg Smith, 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 17:22:15
Message-ID:	4CAE01C7.7070505@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/7/10 6:41 AM, Aidan Van Dyk wrote:
> I'm really confused with all this k < N scenarious I see bandied
> about, because, all it really amounts to is "I only want *one*
> syncronous replication, and a bunch of synchrounous replications".
> And a bit of chance thrown in the mix to hope the "syncronous" one is
> pretty stable, the asynchronous ones aren't *too* far behind (define
> too and far at your leisure).

Effectively, yes. The the difference between k of N synch rep and 1
synch standby + several async standbys is that in k of N, you have a
pool and aren't dependent on having a specific standby be very reliable,
just that any one of them is.

So if you have k = 3 and N = 10, then you can have 10 standbys and only
3 of them need to ack any specific commit for the master to proceed. As
long as (a) you retain at least one of the 3 which ack'd, and (b) you
have some way of determining which standby is the most "caught up", data
loss is fairly unlikely; you'd need to lose 4 of the 10, and the wrong
4, to lose data.

The advantage of this for availability over just having k = N = 3 comes
when one of the standbys is responding slowly (due to traffic) or goes
offline unexpectedly due to a hardware failure. In the k = N = 3 case,
the system halts. In the k = 3, N = 10 case, you can lose up to 7
standbys without the system going down.

It's notable that the massively scalable transactional databases
(Dynamo, Cassandra, various telecom databases, etc.) all operate this way.

However, I do consider this "advanced" functionality and not worth
pursuing until we have the k = 1 case implemented and well-tested. For
comparison, Cassandra, Hypertable and Riak have been working on their k
< N functionality for a couple years now and none of them has it stable
*and* fast.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 17:44:43
Message-ID:	AANLkTi=B0d75Pf4W4GUgKVHhCJs_Rh=CMWNp5xfT40B_@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 7, 2010 at 1:22 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> So if you have k = 3 and N = 10, then you can have 10 standbys and only
> 3 of them need to ack any specific commit for the master to proceed. As
> long as (a) you retain at least one of the 3 which ack'd, and (b) you
> have some way of determining which standby is the most "caught up", data
> loss is fairly unlikely; you'd need to lose 4 of the 10, and the wrong
> 4, to lose data.
>
> The advantage of this for availability over just having k = N = 3 comes
> when one of the standbys is responding slowly (due to traffic) or goes
> offline unexpectedly due to a hardware failure. In the k = N = 3 case,
> the system halts. In the k = 3, N = 10 case, you can lose up to 7
> standbys without the system going down.

Sure, but here is where I might not be following.

If you want "synchronous replication" because you want "query
availabilty" while making sure you're not getting "stale" queries from
all your slaves, than using your k < N (k = 3 and N - 10) situation is
screwing your self.

To get "non-stale" responses, you can only query those k=3 servers.
But you've shot your self in the foot because you don't know which
3/10 those will be. The other 7 *are* stale (by definition). They
talk about picking the "caught up" slave when the master fails, but
you actually need to do that for *every query*.

If you say they are "pretty close so by the time you get the query to
them they will be caught up", well then, all you really want is good
async replication, you don't really *need* the synchronous part.

The only case I see a "race to quorum" type of k < N being useful is
if you're just trying to duplicate data everywhere, but not actually
querying any of the replicas. I can see that "all queries go to the
master, but the chances are pretty high the multiple machines are
going to fail so I want >> multiple replicas" being useful, but I
*don't* think that's what most people are wanting in their "I want 3
of 10 servers to ack the commit".

The difference between good async and sync is only the *guarentee*.
If you don't need the guarantee, you don't need the synchronous part.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 17:48:21
Message-ID:	4CAE07E5.6090403@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> If you want "synchronous replication" because you want "query
> availabilty" while making sure you're not getting "stale" queries from
> all your slaves, than using your k < N (k = 3 and N - 10) situation is
> screwing your self.

Correct. If that is your reason for synch standby, then you should be
using k = N configuration.

However, some people are willing to sacrifice consistency for durability
and availability. We should give them that option (eventually), since
among that triad you can never have more than two.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 17:50:40
Message-ID:	4CAE0870.7010307@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/07/2010 06:41 PM, Greg Smith wrote:
> The cost of hardware capable of running a database server is a large
> multiple of what you can build an alerting machine for.

You realize you don't need lots of disks nor RAM for a box that only
ACKs? A box with two SAS disks and a BBU isn't that expensive anymore.

> I do not disagree with your theory or reasoning. But as a practical
> matter, I'm afraid the true cost of the better guarantee you're
> suggesting here is additional code complexity that will likely cause
> this feature to miss 9.1 altogether. As far as I'm concerned, this
> whole diversion into the topic of quorum commit is only consuming
> resources away from targeting something achievable in the time frame of
> a single release.

So far I've been under the impression that Simon already has the code
for quorum_commit k = 1.

What I'm opposing to is the timeout "feature", which I consider to be
additional code, unneeded complexity and foot-gun.

> You cannot make a single server reliable enough to survive all of
> the things that Murphy's Law will inflict upon it, at any price.

That's exactly what I'm saying applies to two servers as well. And why a
timeout is a bad thing here, because the chance the second nodes fails
as well is there (and is higher than you think, according to Murphy).

> For
> most of the businesses I work with who want sync rep, data is not
> considered safe until the second copy is on storage miles away from the
> original, because they know this too.

Now, that are the people who really need sync rep, yes. What do you
think how happy those businesses were to find out that Postgres is
cheating on them in case of a network outage, for example? Do they
really value (write!) availability more than data safety?

> Silly
> me, I'd only spread them across two adjacent states with different power
> providers! Not nearly good enough to avoid a correlated failure.

Thanks for sharing this. I hope you didn't loose data.

Regards

Markus Wanner

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 17:59:06
Message-ID:	4CAE0A6A.9070301@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> But as a practical matter, I'm afraid the true cost of the better
> guarantee you're suggesting here is additional code complexity that will
> likely cause this feature to miss 9.1 altogether. As far as I'm
> concerned, this whole diversion into the topic of quorum commit is only
> consuming resources away from targeting something achievable in the time
> frame of a single release.

Yes. My purpose in starting this thread was to show that k > 1 "quorum
commit" is considerably more complex than the people who have been
bringing it up in other threads seem to think it is. It is not
achievable for 9.1, and maybe not even for 9.2.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Dimitri Fontaine" <dimitri(at)2ndquadrant(dot)fr>, "Markus Wanner" <markus(at)bluegap(dot)ch>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 18:10:19
Message-ID:	4CADC6BB0200002500036657@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:

> To get "non-stale" responses, you can only query those k=3
> servers. But you've shot your self in the foot because you don't
> know which 3/10 those will be. The other 7 *are* stale (by
> definition). They talk about picking the "caught up" slave when
> the master fails, but you actually need to do that for *every
> query*.

With web applications, at least, you often don't care that the data
read is absolutely up-to-date, as long as the point in time doesn't
jump around from one request to the next. When we have used load
balancing between multiple database servers (which has actually
become unnecessary for us lately because PostgreSQL has gotten so
darned fast!), we have established affinity between a session and
one of the database servers, so that if they became slightly out of
sync, data would not pop in and out of existence arbitrarily. I
think a reasonable person could combine this technique with a "3 of
10" synchronous replication quorum to get both safe persistence of
data and reasonable performance.

I can also envision use cases where this would not be desirable.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Markus Wanner <markus(at)bluegap(dot)ch>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 18:13:21
Message-ID:	AANLkTinvgi3ZgzkpB3DLSrep43M_M4_8Cq6gXz3gyxaA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 7, 2010 at 2:10 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
>
>> To get "non-stale" responses, you can only query those k=3
>> servers. But you've shot your self in the foot because you don't
>> know which 3/10 those will be. The other 7 *are* stale (by
>> definition). They talk about picking the "caught up" slave when
>> the master fails, but you actually need to do that for *every
>> query*.
>
> With web applications, at least, you often don't care that the data
> read is absolutely up-to-date, as long as the point in time doesn't
> jump around from one request to the next. When we have used load
> balancing between multiple database servers (which has actually
> become unnecessary for us lately because PostgreSQL has gotten so
> darned fast!), we have established affinity between a session and
> one of the database servers, so that if they became slightly out of
> sync, data would not pop in and out of existence arbitrarily. I
> think a reasonable person could combine this technique with a "3 of
> 10" synchronous replication quorum to get both safe persistence of
> data and reasonable performance.
>
> I can also envision use cases where this would not be desirable.

Well, keep in mind all updates have to be done on the single master.
That works pretty well for fine-grained replication, but I don't think
it's very good for full-cluster replication.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Dimitri Fontaine" <dimitri(at)2ndquadrant(dot)fr>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Markus Wanner" <markus(at)bluegap(dot)ch>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 18:31:53
Message-ID:	4CADCBC90200002500036669@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

>> With web applications, at least, you often don't care that the
>> data read is absolutely up-to-date, as long as the point in time
>> doesn't jump around from one request to the next. When we have
>> used load balancing between multiple database servers (which has
>> actually become unnecessary for us lately because PostgreSQL has
>> gotten so darned fast!), we have established affinity between a
>> session and one of the database servers, so that if they became
>> slightly out of sync, data would not pop in and out of existence
>> arbitrarily. I think a reasonable person could combine this
>> technique with a "3 of 10" synchronous replication quorum to get
>> both safe persistence of data and reasonable performance.
>>
>> I can also envision use cases where this would not be desirable.
>
> Well, keep in mind all updates have to be done on the single
> master. That works pretty well for fine-grained replication, but
> I don't think it's very good for full-cluster replication.

I'm completely failing to understand your point here. Could you
restate another way?

-Kevin

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 18:38:17
Message-ID:	4CAE1399.6090106@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/07/2010 03:19 PM, Dimitri Fontaine wrote:
> I think you're all into durability, and that's good. The extra cost is
> service downtime

It's just *reduced* availability. That doesn't necessarily mean
downtime, if you combine cleverly with async replication.

> if that's not what you're after: there's also
> availability and load balancing read queries on a system with no lag (no
> stale data servicing) when all is working right.

All I'm saying is that those use cases are much better served with async
replication. Maybe together with something that warns and takes action
in case the standby's lag gets too big.

Or what kind of customers do you think really need a no-lag solution for
read-only queries? In the LAN case, the lag of async rep is negligible
and in the WAN case the latencies of sync rep are prohibitive.

> My proposal is to make the risk window obvious and the
> behavior when you enter it configurable.

I don't buy that. The risk calculation gets a lot simpler and obvious
with strict guarantees.

Regards

Markus Wanner

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 18:51:30
Message-ID:	4CAE16B2.50206@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/07/2010 07:44 PM, Aidan Van Dyk wrote:
> The only case I see a "race to quorum" type of k < N being useful is
> if you're just trying to duplicate data everywhere, but not actually
> querying any of the replicas. I can see that "all queries go to the
> master, but the chances are pretty high the multiple machines are
> going to fail so I want >> multiple replicas" being useful, but I
> *don't* think that's what most people are wanting in their "I want 3
> of 10 servers to ack the commit".

What else do you think they want it for, if not for protection against
data loss?

(Note that the queries don't need to go to the master exclusively if you
can live with some lag - and I think the vast majority of people can.
The zero data loss guarantee holds true in any case, though).

> The difference between good async and sync is only the *guarentee*.
> If you don't need the guarantee, you don't need the synchronous part.

Here we are exactly on the same page again.

Regards

Markus Wanner

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 19:19:15
Message-ID:	AANLkTin5OsOtETQpdWkD25Tw8QkRiuyyfQ5HCvE=FQfb@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 7, 2010 at 2:31 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>
>>> With web applications, at least, you often don't care that the
>>> data read is absolutely up-to-date, as long as the point in time
>>> doesn't jump around from one request to the next. When we have
>>> used load balancing between multiple database servers (which has
>>> actually become unnecessary for us lately because PostgreSQL has
>>> gotten so darned fast!), we have established affinity between a
>>> session and one of the database servers, so that if they became
>>> slightly out of sync, data would not pop in and out of existence
>>> arbitrarily. I think a reasonable person could combine this
>>> technique with a "3 of 10" synchronous replication quorum to get
>>> both safe persistence of data and reasonable performance.
>>>
>>> I can also envision use cases where this would not be desirable.
>>
>> Well, keep in mind all updates have to be done on the single
>> master. That works pretty well for fine-grained replication, but
>> I don't think it's very good for full-cluster replication.
>
> I'm completely failing to understand your point here. Could you
> restate another way?

Establishing an affinity between a session and one of the database
servers will only help if the traffic is strictly read-only.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 19:26:35
Message-ID:	m2zkupg5xw.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner <markus(at)bluegap(dot)ch> writes:
> I don't buy that. The risk calculation gets a lot simpler and obvious
> with strict guarantees.

Ok, I'm lost in the use cases and analysis.

I still don't understand why you want to consider the system already
synchronous when it's not, whatever is the guarantee you're asking for.

All I'm saying is that we should be able to know and show what the
current system is up to, and we should be able to offer sane reactions
in case of errors.

You're calling a sane reaction blocking the master entirely when the
standby ain't ready yet (it's still at the base backup state), and I can
live with that. As an option.

I say that either we go the lax quorum route, or we have to care for
details and summary the failure cases with precision, and the possible
responses with care. I don't see that possible without a clear state of
each element in the system, their transitions, and a way to derive the
global state of the distributed system out of that.

It might be that the simpler way to go here is what Greg Smith has been
proposing for a long time already, and again quite recently on this
thread: have all the information you need in a system table and offer to
run a user defined function to determine the state of the system.

I think we managed to show what Josh Berkus wanted to know now. That's a
quagmire here. Now, the problem I have is not Quorum Commit but the very
definition of synchronous replication and the system we're trying to
build. Not sure there's two of us wanting the same thing here.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Dimitri Fontaine" <dimitri(at)2ndquadrant(dot)fr>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Markus Wanner" <markus(at)bluegap(dot)ch>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "Jeff Davis" <pgsql(at)j-davis(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 19:28:50
Message-ID:	4CADD922020000250003667A@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Establishing an affinity between a session and one of the database
> servers will only help if the traffic is strictly read-only.

Thanks; I now see your point.

In our environment, that's pretty common. Our most heavily used web
app (the one for which we have, at times, needed load balancing)
connects to the database with a read-only login. Many of our web
apps do their writing by posting to queues which are handled at the
appropriate source database later. (I had the opportunity to use
one of these "for real" last night, to fill in a juror questionnaire
after receiving a summons from the jury clerk in the county where I
live.)

Like I said, there are sane cases for this usage, but it won't fit
everybody. I have no idea on percentages.

-Kevin

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 22:25:36
Message-ID:	1286490336.2304.400.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-10-07 at 13:44 -0400, Aidan Van Dyk wrote:

> To get "non-stale" responses, you can only query those k=3 servers.
> But you've shot your self in the foot because you don't know which
> 3/10 those will be. The other 7 *are* stale (by definition). They
> talk about picking the "caught up" slave when the master fails, but
> you actually need to do that for *every query*.

There is a big confusion around that point and I need to point out that
statement isn't accurate. It's taken me a long while to understand this.

Asking for k > 1 does *not* mean those servers are time synchronised.
All it means is that the master will stop waiting after 3
acknowledgements. There is no connection between the master receiving
acknowledgements and the standby applying changes received from master;
the standbys are all independent of one another.

In a bad case, those 3 acknowledgements might happen say 5 seconds apart
on the worst and best of the 3 servers. So the first standby to receive
the data could have applied the changes ~4.8 seconds prior to the 3rd
standby. There is still a chance of reading stale data on one standby,
but reading fresh data on another server. In most cases the time window
is small, but still exists.

The other 7 are stale with respect to the first 3. But then so are the
last 9 compared with the first one. The value of k has nothing
whatsoever to do with the time difference between the master and the
last standby to receive/apply the changes. The gap between first and
last standby (i.e. N, not k) is the time window during which a query
might/might not see a particular committed result.

So standbys are eventually consistent whether or not the master relies
on them to provide an acknowledgement. The only place where you can
guarantee non-stale data is on the master.

High values of k reduce the possibility of data loss, whereas expected
cluster availability is reduced as N - k gets smaller.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 22:30:36
Message-ID:	1286490636.2304.403.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-10-07 at 19:50 +0200, Markus Wanner wrote:

> So far I've been under the impression that Simon already has the code
> for quorum_commit k = 1.

I do, but its not a parameter. The k = 1 behaviour is hardcoded and
considerably simplifies the design. Moving to k > 1 is additional work,
slows things down and seems likely to be fragile.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 22:31:25
Message-ID:	4CAE4A3D.6040306@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

All,

> Establishing an affinity between a session and one of the database
> servers will only help if the traffic is strictly read-only.

I think this thread has drifted very far away from anything we're going
to do for 9.1. And seems to have little to do with synchronous replication.

Synch rep ensures durability. It is not, by itself, a method of
ensuring consistency, nor does it pretend to be one.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-07 23:44:27
Message-ID:	4CAE5B5B.2090600@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner wrote:
> So far I've been under the impression that Simon already has the code
> for quorum_commit k = 1.
>
> What I'm opposing to is the timeout "feature", which I consider to be
> additional code, unneeded complexity and foot-gun.
>

Additional code? Yes. Foot-gun? Yes. Timeout should be disabled by
default so that you get wait forever unless you ask for something
different? Probably. Unneeded? This is where we don't agree anymore.
The example that Josh Berkus just sent to the list is a typical example
of what I expect people to do here. They'll use Sync Rep to maximize
the odds a system failure doesn't cause any transaction loss. They'll
use good quality hardware on the master so it's unlikely to fail. But
when the database finds the standby unreachable, and it's left with the
choice between either degrading into async rep or coming to a complete
halt, you must give people the option of choosing to degrade instead
after a timeout. Let them set off the red flashing lights, sound the
alarms, and pray the master doesn't go down until you can fix the
problem. But the choice to allow uptime concerns to win over the normal
sync rep preferences, that's a completely valid business decision people
will absolutely want to make in a way opposite of your personal
preference here.

I don't see this as needing any implementation any more complicated than
the usual way such timeouts are handled. Note how long you've been
trying to reach the standby. Default to -1 for forever. And if you hit
the timeout, mark the standby as degraded and force them to do a proper
resync when they disconnect. Once that's done, then they can re-enter
sync rep mode again, via the same process a new node would have done so.

--
Greg Smith, 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance" Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 02:01:54
Message-ID:	AANLkTimtB40UV=oy9sVK2dY36ypu-3vsoxHes8jn+O=V@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 6, 2010 at 6:11 PM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
> Yeah, sounds more likely. Then I'm surprised that I didn't find any
> warning that the Protocol C definitely reduces availability (with the
> ko-count=0 default, that is).

Really? I don't think that ko-count=0 means "wait-forever". IIRC,
when I tried DRBD, I can write data in master's DRBD disk, without
connected standby. So I think that by default the master waits for
timeout and works alone when the standby goes down.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 02:24:35
Message-ID:	AANLkTimTRDEjxPq1uYJ1JQR3F4--Vs5wJinunkePAaWu@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 6, 2010 at 6:00 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> In general, salvaging the WAL that was not sent to the standby yet is
> outright impossible. You can't achieve zero data loss with asynchronous
> replication at all.

No. That depends on the type of failure. Unless the disk in the master has
been corrupted, we might be able to salvage WAL.

>> If we want only no data loss, we have only to implement the wait-forever
>> option. But if we make consideration for the above-mentioned availability,
>> the return-immediately option also would be required.
>>
>> In some (many, I think) cases, I think that we need to consider
>> availability
>> and no data loss together, and consider the balance of them.
>
> If you need both, you need three servers as Simon pointed out earlier. There
> is no way around that.

No. That depends on how far you'd like to ensure no data loss.

Poeple who use shared disk failover solution with one master and one standby
don't such a high durability. They can avoid data loss by using something
like RAID to a certain extent. So it's not problem for them to run the master
alone after failover happens or standby goes down. But something like RAID
cannot increase availability. Synchronous replication is solution for that
purpose.

Of course, if we are worried about running the master alone, we can increase
the number of standbys. Furthermore, if we'd like to avoid data loss from the
disaster which can destroy all the servers at the same time, we might need to
increase the standbys further and locate some of them in the remote site.

Please imagine that "return-immediately (i.e., timeout = small)" is useful
for some use cases.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 02:35:56
Message-ID:	AANLkTik9Y8BVc2mgeQEECj5ybm1T3tuGigwjyqHYqEW0@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 7, 2010 at 10:24 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Wed, Oct 6, 2010 at 6:00 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> In general, salvaging the WAL that was not sent to the standby yet is
>> outright impossible. You can't achieve zero data loss with asynchronous
>> replication at all.
>
> No. That depends on the type of failure. Unless the disk in the master has
> been corrupted, we might be able to salvage WAL.

So I guess another way to say this is that zero data loss is
unachievable, period. Greg Smith made a flip comment about having
been so silly as to only put his redundant servers in adjacent states
on different power grids, and yet still having an outage due to the
Northeast blackouts. So what would he have had to do to completely
rule out a correlated failure?

Answer: It can't be done. If a massive asteroid comes zooming into
the inner solar system tomorrow and hits the earth, obliterating all
life, you're toast. Or likewise if nuclear war ensues. You could put
your redundant server on the moon or, better yet, on a moon of one of
the outer planets, but the hosting costs are pretty high and the ping
times suck.

So the point is that the question is not whether or not a correlated
failure can happen, but whether you can imagine a scenario where a
correlated failure has occurred yet you still wish you had your data.
Different people will, obviously, draw that line in different places.
Let's start by doing something simple that covers SOME of the cases
people want, get it committed, and then move on from there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 02:52:29
Message-ID:	AANLkTimhj+aXXtrV-chhzqVocNSz1bQbRJptQHgNMkw_@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 6, 2010 at 9:22 PM, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr> wrote:
> From my experience operating londiste, those states would be:
>
> 1. base-backup — self explaining
> 2. catch-up — getting the WAL to catch up after base backup
> 3. wanna-sync — don't yet have all the WAL to get in sync
> 4. do-sync — all WALs are there, coming soon
> 5. ok (async | recv | fsync | reply — feedback loop engaged)

I agree to mange these standby states in a different standpoint.

To avoid data loss, we must not promote the standby which is
catching up with the master in half way to the new master at the
failover. If clusterware can get the current standby state via SQL,
it can check whether the failover causes data loss or not
and give up failover before creating the trigger file.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Markus Wanner <markus(at)bluegap(dot)ch>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 03:30:23
Message-ID:	AANLkTi=ysQ0SpofKc_=tX5mt6c8XXFcPm44z8qYM0F-g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 7, 2010 at 5:01 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> You seem willing to trade anything for that guarantee. I seek a more
> pragmatic approach that balances availability and risk.
>
> Those views are different, but not inconsistent. Oracle manages to offer
> multiple options and so can we.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 03:41:29
Message-ID:	AANLkTi=f7TtF2=sKeHg+=E1puZmV=brtuUwMinPp9iNS@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 7, 2010 at 3:01 AM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
> Of course, it doesn't make sense to wait-forever on *every* standby that
> ever gets added. Quorum commit is required, yes (and that's what this
> thread is about, IIRC). But with quorum commit, adding a standby only
> improves availability, but certainly doesn't block the master in any
> way.

But, even with quorum commit, if you choose wait-forever option,
failover would decrease availability. Right after the failover,
no standby has connected to new master, so if quorum >= 1, all
the transactions must wait for a while.

Basically we need to take a base backup from new master to start
the standbys and make them connect to new master. This might take
a long time. Since transaction commits cannot advance for that time,
availability would goes down.

Or you think that wait-forever option is applied only when the
standby goes down?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 05:30:03
Message-ID:	1286515803.19723.59.camel@jd-desktop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-10-07 at 19:44 -0400, Greg Smith wrote:

> I don't see this as needing any implementation any more complicated than
> the usual way such timeouts are handled. Note how long you've been
> trying to reach the standby. Default to -1 for forever. And if you hit
> the timeout, mark the standby as degraded and force them to do a proper
> resync when they disconnect. Once that's done, then they can re-enter
> sync rep mode again, via the same process a new node would have done so.

What I don't understand is why this isn't obvious to everyone. Greg this
is very well put and the -hackers need to start thinking like people
that actually use the database.

JD
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 05:52:09
Message-ID:	AANLkTik49weTKO6MBVjTPunnqYjP-EWJ7PsuXroEdYf2@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 8, 2010 at 8:44 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Additional code? Yes. Foot-gun? Yes. Timeout should be disabled by
> default so that you get wait forever unless you ask for something different?
> Probably. Unneeded? This is where we don't agree anymore. The example
> that Josh Berkus just sent to the list is a typical example of what I expect
> people to do here. They'll use Sync Rep to maximize the odds a system
> failure doesn't cause any transaction loss. They'll use good quality
> hardware on the master so it's unlikely to fail. But when the database
> finds the standby unreachable, and it's left with the choice between either
> degrading into async rep or coming to a complete halt, you must give people
> the option of choosing to degrade instead after a timeout. Let them set off
> the red flashing lights, sound the alarms, and pray the master doesn't go
> down until you can fix the problem. But the choice to allow uptime concerns
> to win over the normal sync rep preferences, that's a completely valid
> business decision people will absolutely want to make in a way opposite of
> your personal preference here.

Definitely agreed.

> I don't see this as needing any implementation any more complicated than the
> usual way such timeouts are handled. Note how long you've been trying to
> reach the standby. Default to -1 for forever. And if you hit the timeout,
> mark the standby as degraded and force them to do a proper resync when they
> disconnect. Once that's done, then they can re-enter sync rep mode again,
> via the same process a new node would have done so.

Fair enough.

One question is when this timeout is applied. Obviously it should be applied
when the standby goes down. But timeout should be applied when we initially
start the master, and when no standby has not connected to new master yet after
failover?

I guess that people who want wait-forever would want to use "timeout = -1"
for all those cases. Otherwise they cannot ensure their no data loss.

OTOH, people who don't want wait-forever would not want to wait for timeout
in the latter two cases. So ISTM that something like enable_wait_forever or
reaction_after_timeout parameter is required separately from the timeout.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 07:13:55
Message-ID:	m2wrptdumk.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
[…]
> I don't see this as needing any implementation any more complicated than the
> usual way such timeouts are handled. Note how long you've been trying to
> reach the standby. Default to -1 for forever. And if you hit the timeout,
> mark the standby as degraded and force them to do a proper resync when they
> disconnect. Once that's done, then they can re-enter sync rep mode again,
> via the same process a new node would have done so.

Thank you for this post, which is so much better than anything I could
achieve.

Just wanted to add that it should be possible in lots of cases to have a
standby rejoin the party without getting as far back as taking a new
base backup. Depends on wal_keep_segments and standby's degraded state,
among other parameters (archives, etc).

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 07:46:46
Message-ID:	4CAECC66.3080509@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/08/2010 12:30 AM, Simon Riggs wrote:
> I do, but its not a parameter. The k = 1 behaviour is hardcoded and
> considerably simplifies the design. Moving to k > 1 is additional work,
> slows things down and seems likely to be fragile.

Perfect! So I'm all in favor of committing that, but leaving away the
timeout thing, which I think is just adding unneeded complexity and
fragility.

Regards

Markus Wanner

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Josh Berkus <josh(at)agliodbs(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 07:52:33
Message-ID:	4CAECDC1.4060004@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon,

On 10/08/2010 12:25 AM, Simon Riggs wrote:
> Asking for k > 1 does *not* mean those servers are time synchronised.

Yes, it's technically impossible to create a fully synchronized cluster
(on the basis of shared-nothing nodes we are aiming for, that is). There
always is some kind of "lag" on either side.

Maybe the use case for a no-lag cluster doesn't exist, because it's
technically not feasible.

> In a bad case, those 3 acknowledgements might happen say 5 seconds apart
> on the worst and best of the 3 servers. So the first standby to receive
> the data could have applied the changes ~4.8 seconds prior to the 3rd
> standby. There is still a chance of reading stale data on one standby,
> but reading fresh data on another server. In most cases the time window
> is small, but still exists.

Well, the transaction isn't committed on the master, so one could argue
it shouldn't matter. The guarantee just needs to be one way: as soon as
confirmed committed to the client, all k standbies need to have it
committed, too. (At least for the "apply" replication level).

> So standbys are eventually consistent whether or not the master relies
> on them to provide an acknowledgement. The only place where you can
> guarantee non-stale data is on the master.

That's formulated a bit too strong. With "apply" replication level, you
should be able to rely on the guarantee that a committed transaction is
visible on at least k standbies. Maybe in advance of the commit on the
master, but I wouldn't call that "stale" data.

Given the current proposals, the master is the one that's "lagging" the
most, compared to the k standbies.

> High values of k reduce the possibility of data loss, whereas expected
> cluster availability is reduced as N - k gets smaller.

Exactly. One addendum: a timeout increases availability at the cost of
increased danger of data loss and higher complexity. Don't use it, just
increase (N - k) instead.

Regards

Markus Wanner

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 07:56:15
Message-ID:	4CAECE9F.9080309@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07.10.2010 21:38, Markus Wanner wrote:
> On 10/07/2010 03:19 PM, Dimitri Fontaine wrote:
>> I think you're all into durability, and that's good. The extra cost is
>> service downtime
>
> It's just *reduced* availability. That doesn't necessarily mean
> downtime, if you combine cleverly with async replication.
>
>> if that's not what you're after: there's also
>> availability and load balancing read queries on a system with no lag (no
>> stale data servicing) when all is working right.
>
> All I'm saying is that those use cases are much better served with async
> replication. Maybe together with something that warns and takes action
> in case the standby's lag gets too big.
>
> Or what kind of customers do you think really need a no-lag solution for
> read-only queries? In the LAN case, the lag of async rep is negligible
> and in the WAN case the latencies of sync rep are prohibitive.

There is a very good use case for that particular set up, actually. If
your hot standby is guaranteed to be up-to-date with any transaction
that has been committed in the master, you can use the standby
interchangeably with the master for read-only queries. Very useful for
load balancing. Imagine a web application that's mostly read-only, but a
user can modify his own personal details like name and address, for
example. Imagine that the user changes his street address and clicks
'save', causing an UPDATE, and the next query fetches that information
again to display to the user. If you use load balancing, the query can
be routed to the hot standby server, and if it lags even 1-2 seconds
behind it's quite possible that it will still return the old address.
The user will go "WTF, I just changed that!".

That's the "load balancing" use case, which is quite different from the
"zero data loss on server failure" use case that most people here seem
to be interested in.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 08:07:23
Message-ID:	4CAED13B.9010800@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/08/2010 04:01 AM, Fujii Masao wrote:
> Really? I don't think that ko-count=0 means "wait-forever".

Telling from the documentation, I'd also say it doesn't wait forever by
default. However, please note that there are different parameters for
the initial wait for connection during boot up (wfc-timeout and
degr-wfc-timeout). So you might to test what happens on a node failure,
not just absence of a standby.

Regards

Markus Wanner

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 08:10:49
Message-ID:	4CAED209.4010008@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08.10.2010 06:41, Fujii Masao wrote:
> On Thu, Oct 7, 2010 at 3:01 AM, Markus Wanner<markus(at)bluegap(dot)ch> wrote:
>> Of course, it doesn't make sense to wait-forever on *every* standby that
>> ever gets added. Quorum commit is required, yes (and that's what this
>> thread is about, IIRC). But with quorum commit, adding a standby only
>> improves availability, but certainly doesn't block the master in any
>> way.
>
> But, even with quorum commit, if you choose wait-forever option,
> failover would decrease availability. Right after the failover,
> no standby has connected to new master, so if quorum>= 1, all
> the transactions must wait for a while.

Sure, the new master can't proceed with commits until enough standbys
have connected to it.

> Basically we need to take a base backup from new master to start
> the standbys and make them connect to new master.

Do we really need that? I don't think that's acceptable, we'll need to
fix that if that's the case.

I think you're right, streaming replication doesn't work across timeline
changes. We left that out of 9.0, to keep things simple, but it seems
that we really should fix that for 9.1.

You can cross timelines with the archive, though. But IIRC there was
some issue with that too, you needed to restart the standbys because the
standby scans what timelines exist at the beginning of recovery, and
won't notice new timelines that appear after that?

We need to address that, apart from any of the other things discussed
wrt. synchronous replication. It will benefit asynchronous replication
too. IMHO *that* is the next thing we should do, the next patch we commit.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Josh Berkus <josh(at)agliodbs(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 08:11:23
Message-ID:	1286525483.2304.510.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2010-10-08 at 09:52 +0200, Markus Wanner wrote:

> One addendum: a timeout increases availability at the cost of
> increased danger of data loss and higher complexity. Don't use it,
> just increase (N - k) instead.

Completely agree.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 08:16:08
Message-ID:	4CAED348.2020304@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/08/2010 05:41 AM, Fujii Masao wrote:
> But, even with quorum commit, if you choose wait-forever option,
> failover would decrease availability. Right after the failover,
> no standby has connected to new master, so if quorum >= 1, all
> the transactions must wait for a while.

That's a point, yes. But again, this is just write-availability, you can
happily read from all active standbies. And connection time is certainly
negligible compared to any kind of timeout (which certainly needs to be
way bigger than a couple of network round-trips).

> Basically we need to take a base backup from new master to start
> the standbys and make them connect to new master. This might take
> a long time. Since transaction commits cannot advance for that time,
> availability would goes down.

Just don't increase your quorum_commit to unreasonable values which your
hardware cannot possible satisfy. It doesn't make sense to set a
quorum_commit of 1 or even bigger, if you don't already have a standby
attached.

Start with 0 (i.e. replication off), then add standbies, then increase
quorum_commit to your new requirements.

> Or you think that wait-forever option is applied only when the
> standby goes down?

That wouldn't work in case of a full-cluster crash, where the
wait-forever option is required again. Otherwise you risk a split-brain
situation.

Regards

Markus Wanner

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Josh Berkus <josh(at)agliodbs(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 08:18:23
Message-ID:	4CAED3CF.7090503@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08.10.2010 01:25, Simon Riggs wrote:
> On Thu, 2010-10-07 at 13:44 -0400, Aidan Van Dyk wrote:
>
>> To get "non-stale" responses, you can only query those k=3 servers.
>> But you've shot your self in the foot because you don't know which
>> 3/10 those will be. The other 7 *are* stale (by definition). They
>> talk about picking the "caught up" slave when the master fails, but
>> you actually need to do that for *every query*.
>
> There is a big confusion around that point and I need to point out that
> statement isn't accurate. It's taken me a long while to understand this.
>
> Asking for k> 1 does *not* mean those servers are time synchronised.
> All it means is that the master will stop waiting after 3
> acknowledgements. There is no connection between the master receiving
> acknowledgements and the standby applying changes received from master;
> the standbys are all independent of one another.
>
> In a bad case, those 3 acknowledgements might happen say 5 seconds apart
> on the worst and best of the 3 servers. So the first standby to receive
> the data could have applied the changes ~4.8 seconds prior to the 3rd
> standby. There is still a chance of reading stale data on one standby,
> but reading fresh data on another server. In most cases the time window
> is small, but still exists.
>
> The other 7 are stale with respect to the first 3. But then so are the
> last 9 compared with the first one. The value of k has nothing
> whatsoever to do with the time difference between the master and the
> last standby to receive/apply the changes. The gap between first and
> last standby (i.e. N, not k) is the time window during which a query
> might/might not see a particular committed result.
>
> So standbys are eventually consistent whether or not the master relies
> on them to provide an acknowledgement. The only place where you can
> guarantee non-stale data is on the master.

Yes, that's a good point. Synchronous replication for load-balancing
purposes guarantees that when *you* perform a commit, after it finishes
it will be visible in all standbys. But if you run the same query across
different standbys, you're not guaranteed get same results. If you just
pick a random server for every query, you might even see time moving
backwards. Affinity is definitely a good idea for the load-balancing
scenario, but even then the anomaly is possible if you get re-routed to
a different server because the one you were bound to dies.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 08:25:01
Message-ID:	1286526301.2304.537.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2010-10-08 at 10:56 +0300, Heikki Linnakangas wrote:
> >
> > Or what kind of customers do you think really need a no-lag solution for
> > read-only queries? In the LAN case, the lag of async rep is negligible
> > and in the WAN case the latencies of sync rep are prohibitive.
>
> There is a very good use case for that particular set up, actually. If
> your hot standby is guaranteed to be up-to-date with any transaction
> that has been committed in the master, you can use the standby
> interchangeably with the master for read-only queries.

This is an important point. It is desirable, but there is no such thing.
We must not take any project decisions based upon that false premise.

Hot Standby is never guaranteed to be up-to-date with master. There is
no such thing as certainty that you have the same data as the master.

All sync rep gives you is a better durability guarantee that the changes
are safe. It doesn't guarantee those changes are transferred to all
nodes prior to making the data changes on any one standby.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 08:27:11
Message-ID:	4CAED5DF.4090000@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08.10.2010 11:25, Simon Riggs wrote:
> On Fri, 2010-10-08 at 10:56 +0300, Heikki Linnakangas wrote:
>>>
>>> Or what kind of customers do you think really need a no-lag solution for
>>> read-only queries? In the LAN case, the lag of async rep is negligible
>>> and in the WAN case the latencies of sync rep are prohibitive.
>>
>> There is a very good use case for that particular set up, actually. If
>> your hot standby is guaranteed to be up-to-date with any transaction
>> that has been committed in the master, you can use the standby
>> interchangeably with the master for read-only queries.
>
> This is an important point. It is desirable, but there is no such thing.
> We must not take any project decisions based upon that false premise.
>
> Hot Standby is never guaranteed to be up-to-date with master. There is
> no such thing as certainty that you have the same data as the master.

Synchronous replication in the 'replay' mode is supposed to guarantee
exactly that, no?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 08:30:35
Message-ID:	4CAED6AB.3040600@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/08/2010 10:27 AM, Heikki Linnakangas wrote:
> Synchronous replication in the 'replay' mode is supposed to guarantee
> exactly that, no?

The master may lag behind, so it's not strictly speaking the same data.

Regards

Markus Wanner

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 08:33:10
Message-ID:	4CAED746.20806@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/08/2010 09:56 AM, Heikki Linnakangas wrote:
> Imagine a web application that's mostly read-only, but a
> user can modify his own personal details like name and address, for
> example. Imagine that the user changes his street address and clicks
> 'save', causing an UPDATE, and the next query fetches that information
> again to display to the user.

I don't think that use case justifies sync replication and the
additional network overhead that brings. Latency is low in that case,
okay, but so is the lag for async replication.

Why not tell the load balancer to read from the master for n seconds
after the last write. After that, it should be save to query standbies,
again.

If the load on the master is the problem, and you want to reduce that by
moving the read-only transactions to the slave, sync replication pretty
certainly won't help you, either, because it actually *increases*
concurrency (by increased commit latency).

Regards

Markus Wanner

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 09:00:31
Message-ID:	1286528431.2304.586.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2010-10-08 at 11:27 +0300, Heikki Linnakangas wrote:
> On 08.10.2010 11:25, Simon Riggs wrote:
> > On Fri, 2010-10-08 at 10:56 +0300, Heikki Linnakangas wrote:
> >>>
> >>> Or what kind of customers do you think really need a no-lag solution for
> >>> read-only queries? In the LAN case, the lag of async rep is negligible
> >>> and in the WAN case the latencies of sync rep are prohibitive.
> >>
> >> There is a very good use case for that particular set up, actually. If
> >> your hot standby is guaranteed to be up-to-date with any transaction
> >> that has been committed in the master, you can use the standby
> >> interchangeably with the master for read-only queries.
> >
> > This is an important point. It is desirable, but there is no such thing.
> > We must not take any project decisions based upon that false premise.
> >
> > Hot Standby is never guaranteed to be up-to-date with master. There is
> > no such thing as certainty that you have the same data as the master.
>
> Synchronous replication in the 'replay' mode is supposed to guarantee
> exactly that, no?

>From the perspective of the person making the change on the master: yes.
If they make the change, wait for commit, then check the value on a
standby, yes it will be there (or a later version).

>From the perspective of an observer, randomly selecting a standby for
load balancing purposes: No, they are not guaranteed to see the "latest"
answer, nor even can they find out whether what they are seeing is the
latest answer.

What sync rep does guarantee is that if the person making the change is
told it succeeded (commit) then that change is safe on at least k other
servers. Sync rep is about guarantees of safety, not observability.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 09:02:49
Message-ID:	4CAEDE39.2060803@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/08/2010 01:44 AM, Greg Smith wrote:
> They'll use Sync Rep to maximize
> the odds a system failure doesn't cause any transaction loss. They'll
> use good quality hardware on the master so it's unlikely to fail.

.."unlikely to fail"?

Ehm.. is that you speaking, Greg? ;-)

> But
> when the database finds the standby unreachable, and it's left with the
> choice between either degrading into async rep or coming to a complete
> halt, you must give people the option of choosing to degrade instead
> after a timeout. Let them set off the red flashing lights, sound the
> alarms, and pray the master doesn't go down until you can fix the
> problem.

Okay, okay, fair enough - if there had been red flashing lights. And
alarms. And bells and whistles. But that's what I'm afraid the timeout
is removing.

..and how do you make sure you are not marking your second standby as
degraded just because it's currently lagging? Effectively degrading the
utterly needed one, because your first standby has just bitten the dust?

And how do you prevent the split brain situation in case the master dies
shortly after these events, but fails to come up again immediately?

Your list of data recovery projects will get larger and the projects
more complicated. Because there's a lot more to it than just the
implementation of a timeout.

Regards

Markus Wanner

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 09:07:36
Message-ID:	4CAEDF58.3020800@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/08/2010 11:00 AM, Simon Riggs wrote:
> From the perspective of an observer, randomly selecting a standby for
> load balancing purposes: No, they are not guaranteed to see the "latest"
> answer, nor even can they find out whether what they are seeing is the
> latest answer.

I completely agree. The application (or at least the load balancer)
needs to be aware of that fact.

Regards

Markus

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 09:41:00
Message-ID:	m2k4ltc98z.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner <markus(at)bluegap(dot)ch> writes:
> ..and how do you make sure you are not marking your second standby as
> degraded just because it's currently lagging?

Well, in sync rep, a standby that's not able to stay under the timeout
is degraded. Full stop. The presence of the timeout (or its value not
being -1) means that the admin has chosen this definition.

> Effectively degrading the
> utterly needed one, because your first standby has just bitten the
> dust?

Well, now you have a worst case scenario: first standby is dead and the
remaining one was not able to keep up. You have lost all your master's
failover replacements.

> And how do you prevent the split brain situation in case the master dies
> shortly after these events, but fails to come up again immediately?

Same old story. Either you're able to try and fix the master so that you
don't lose any data and don't even have to check for that, or you take a
risk and start from a non synced standby. It's all availability against
durability again.

What I really want us to be able to provide is the clear facts so that
whoever has to take the decision is able to. Meaning, here, that it
should be easy to see that neither the standby are in sync at this
point.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 09:57:27
Message-ID:	4CAEEB07.2050404@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/08/2010 11:41 AM, Dimitri Fontaine wrote:
> Same old story. Either you're able to try and fix the master so that you
> don't lose any data and don't even have to check for that, or you take a
> risk and start from a non synced standby. It's all availability against
> durability again.

..and a whole lot of manual work, that's prone to error for something
that could easily be automated, at certainly less than 2000 EUR initial,
additional cost (if any at all, in case you already have three servers).
Sorry, I still fail to understand that use case.

It reminds me of the customer that wanted to save the cost of the BBU
and ran with fsync=off. Until his server got down due to a power outage.

But yeah, we provide that option as well, yes. Point taken.

Regards

Markus Wanner

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Greg Smith <greg(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 10:05:26
Message-ID:	m2bp75c849.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner <markus(at)bluegap(dot)ch> writes:
> ..and a whole lot of manual work, that's prone to error for something
> that could easily be automated

So, the master just crashed, first standby is dead and second ain't in
sync. What's the easy and automated way out? Sorry, I need a hand here.

--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 13:46:49
Message-ID:	AANLkTinucLJuY9XT65_AHW6OKu8oug6xVx4e8uaTvvkJ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 8, 2010 at 5:07 PM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
> On 10/08/2010 04:01 AM, Fujii Masao wrote:
>> Really? I don't think that ko-count=0 means "wait-forever".
>
> Telling from the documentation, I'd also say it doesn't wait forever by
> default. However, please note that there are different parameters for
> the initial wait for connection during boot up (wfc-timeout and
> degr-wfc-timeout). So you might to test what happens on a node failure,
> not just absence of a standby.

Unfortunately I've already taken down my DRBD environment. As far as
I heard from my colleague who is familiar with DRBD, standby node
failure doesn't prevent the master from writing data to the DRBD disk
by default. If there is DRBD environment available around me, I'll try
the test.

And, I'd like to know whether the master waits forever because of the
standby failure in other solutions such as Oracle DataGuard, MySQL
semi-synchronous replication.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 14:11:58
Message-ID:	21937.1286547118@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> I don't see this as needing any implementation any more complicated than
> the usual way such timeouts are handled. Note how long you've been
> trying to reach the standby. Default to -1 for forever. And if you hit
> the timeout, mark the standby as degraded and force them to do a proper
> resync when they disconnect. Once that's done, then they can re-enter
> sync rep mode again, via the same process a new node would have done so.

Well, actually, that's *considerably* more complicated than just a
timeout. How are you going to "mark the standby as degraded"? The
standby can't keep that information, because it's not even connected
when the master makes the decision. ISTM that this requires

1. a unique identifier for each standby (not just role names that
multiple standbys might share);

2. state on the master associated with each possible standby -- not just
the ones currently connected.

Both of those are perhaps possible, but the sense I have of the
discussion is that people want to avoid them.

Actually, #2 seems rather difficult even if you want it. Presumably
you'd like to keep that state in reliable storage, so it survives master
crashes. But how you gonna commit a change to that state, if you just
lost every standby (suppose master's ethernet cable got unplugged)?
Looks to me like it has to be reliable non-replicated storage. Leaving
aside the question of how reliable it can really be if not replicated,
it's still the case that we have noplace to put such information given
the WAL-is-across-the-whole-cluster design.

regards, tom lane

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 14:26:03
Message-ID:	AANLkTinNw0omDyskutdLVAyo1UP7OHHa56ovVxG9HYJ3@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 8, 2010 at 5:10 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Do we really need that?

Yes. But if there is no unsent WAL when the master goes down,
we can start new standby without new backup by copying the
timeline history file from new master to new standby and
setting recovery_target_timeline to 'latest'. In this case,
new standby advances the recovery to the latest timeline ID
which new master uses before connecting to the master.

This seems to have been successful in my test environment.
Though I'm missing something.

> I don't think that's acceptable, we'll need to fix
> that if that's the case.

Agreed.

> You can cross timelines with the archive, though. But IIRC there was some
> issue with that too, you needed to restart the standbys because the standby
> scans what timelines exist at the beginning of recovery, and won't notice
> new timelines that appear after that?

Yes.

> We need to address that, apart from any of the other things discussed wrt.
> synchronous replication. It will benefit asynchronous replication too. IMHO
> *that* is the next thing we should do, the next patch we commit.

You mean to commit that capability before synchronous replication? If so,
I disagree with you. I think that it's not easy to address that problem.
So I'm worried about that implementing that capability first means the miss
of sync rep in 9.1.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 14:26:47
Message-ID:	4CAF2A27.6070904@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/08/2010 04:11 PM, Tom Lane wrote:
> Actually, #2 seems rather difficult even if you want it. Presumably
> you'd like to keep that state in reliable storage, so it survives master
> crashes. But how you gonna commit a change to that state, if you just
> lost every standby (suppose master's ethernet cable got unplugged)?

IIUC you seem to assume that the master node keeps its master role. But
users who value availability a lot certainly want automatic fail-over,
so any node can potentially be the new master.

After recovery from a full-cluster outage, the first question is which
node was the most recent master (or which former standby is up to date
and could take over).

Regards

Markus Wanner

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 14:35:17
Message-ID:	m2hbgw92hm.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Well, actually, that's *considerably* more complicated than just a
> timeout. How are you going to "mark the standby as degraded"? The
> standby can't keep that information, because it's not even connected
> when the master makes the decision. ISTM that this requires
>
> 1. a unique identifier for each standby (not just role names that
> multiple standbys might share);
>
> 2. state on the master associated with each possible standby -- not just
> the ones currently connected.
>
> Both of those are perhaps possible, but the sense I have of the
> discussion is that people want to avoid them.

What we'd like to avoid is for the users to have to cope with such
needs. Now, if that's internal to the code and automatic, that's not the
same thing at all.

What I'd have in mind is a "Database standby system identifier" that
would be part of the initial hand shake in the replication protocol. And
a system function to be able to "unregister" the standby.

> Actually, #2 seems rather difficult even if you want it. Presumably
> you'd like to keep that state in reliable storage, so it survives master
> crashes. But how you gonna commit a change to that state, if you just
> lost every standby (suppose master's ethernet cable got unplugged)?

I don't see that as a huge problem myself, because I'm already well sold
to the per-transaction replication-synchronous behaviour. So any change
done there by the master would be hard-coded as async. What I'm missing?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 14:37:12
Message-ID:	4CAF2C98.9020009@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/08/2010 12:05 PM, Dimitri Fontaine wrote:
> Markus Wanner <markus(at)bluegap(dot)ch> writes:
>> ..and a whole lot of manual work, that's prone to error for something
>> that could easily be automated
>
> So, the master just crashed, first standby is dead and second ain't in
> sync. What's the easy and automated way out? Sorry, I need a hand here.

Thinking this through, I'm realizing that this can potentially work
automatically with three nodes in both cases. Each node needs to keep
track of whether or not it is (or became) the master - and when (lamport
timestamp, maybe, not necessarily wall clock). A new master might
continue to commit new transactions after a fail-over, without the old
master being able to record that fact (because it's down).

This means there's a different requirement after a full-cluster crash
(i.e. master failure and no up-to-date standby is available). With the
timeout, you absolutely need the former master to come back up again for
zero data loss, no matter what your quorum_commit setting was. To be
able to automatically tell who was the most recent master, you need to
query the state of all other nodes, because they could be a more recent
master. If that's not possible (or not feasible, because the replacement
part isn't currently available), you are at risk of data loss.

With the given three node scenario, the zero data loss guarantee only
holds true as long as either at least one node (that is in sync) is
running or if you can recover the former master after a full cluster crash.

When waiting forever, you only need one of the k nodes to come back up
again. You also need to query other nodes to find out which the k of N
nodes are, but being able to recovery (N - k + 1) nodes is sufficient to
figure that out. So any (k-1) nodes may fail, even permanently, at any
point in time, and you are still not at risk of losing data. (Nor at
risk of losing availability, BTW). I'm still of the opinion that that's
the way easier and clearer guarantee.

Also note that with higher values for N, this gets more and more
important, because the chance to be able to recovery all N nodes after a
full crash shrinks with increasing N (while the time required to do so
increases). But maybe the current sync rep feature doesn't need to
target setups with that many nodes.

I certainly agree that either way is complicated to implement. With
Postgtres-R, I'm clearly going the way that's able to satisfy large
numbers of nodes.

Thanks for an interesting discussion. And for respectful disagreement.

Regards

Markus Wanner

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 14:38:15
Message-ID:	22421.1286548695@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner <markus(at)bluegap(dot)ch> writes:
> On 10/08/2010 04:11 PM, Tom Lane wrote:
>> Actually, #2 seems rather difficult even if you want it. Presumably
>> you'd like to keep that state in reliable storage, so it survives master
>> crashes. But how you gonna commit a change to that state, if you just
>> lost every standby (suppose master's ethernet cable got unplugged)?

> IIUC you seem to assume that the master node keeps its master role. But
> users who value availability a lot certainly want automatic fail-over,

Huh? Surely loss of the slaves shouldn't force a failover. Maybe the
slaves really are all dead.

regards, tom lane

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 14:43:49
Message-ID:	4CAF2E25.1020302@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/08/2010 04:38 PM, Tom Lane wrote:
> Markus Wanner <markus(at)bluegap(dot)ch> writes:
>> IIUC you seem to assume that the master node keeps its master role. But
>> users who value availability a lot certainly want automatic fail-over,
>
> Huh? Surely loss of the slaves shouldn't force a failover. Maybe the
> slaves really are all dead.

I think we are talking across each other. I'm speaking about the need to
be able to fail-over to a standby in case the master fails.

In case of a full-cluster crash after such a fail-over, you need to take
care you don't enter split brain. Some kind of STONITH, lamport clock,
or what not. Figuring out which node has been the most recent (and thus
most up to date) master is far from trivial.

(See also my mail in answer to Dimitri a few minutes ago).

Regards

Markus Wanner

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 14:47:11
Message-ID:	1286549231.2304.947.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2010-10-08 at 10:11 -0400, Tom Lane wrote:

> 1. a unique identifier for each standby (not just role names that
> multiple standbys might share);

That is difficult because each standby is identical. If a standby goes
down, people can regenerate a new standby by taking a copy from another
standby. What number do we give this new standby?...

> 2. state on the master associated with each possible standby -- not just
> the ones currently connected.
>
> Both of those are perhaps possible, but the sense I have of the
> discussion is that people want to avoid them.

Yes, I really want to avoid such issues and likely complexities we get
into trying to solve them. In reality they should not be common because
it only happens if the sysadmin has not configured sufficient number of
redundant standbys.

My proposed design is that the timeout does not cause the standby to be
"marked as degraded". It is up to the user to decide whether they wait,
or whether they progress without sync rep. Or sysadmin can release the
waiters via a function call.

If the cluster does become degraded the sysadmin just generates a new
standby and plugs in back into the cluster and away we go. Simple, no
state to be recorded and no state to get screwed up either. I don't
think we should be spending too much time trying to help people that say
they want additional durability guarantees but do not match that with
sufficient hardware resources to make it happen smoothly.

If we do try to tackle those problems who will be able to validate our
code actually works?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 14:48:54
Message-ID:	AANLkTinApo+opo35EUQSUQDZB2UhDki4YH8C1MBNnGmm@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 8, 2010 at 5:16 PM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
> On 10/08/2010 05:41 AM, Fujii Masao wrote:
>> But, even with quorum commit, if you choose wait-forever option,
>> failover would decrease availability. Right after the failover,
>> no standby has connected to new master, so if quorum >= 1, all
>> the transactions must wait for a while.
>
> That's a point, yes. But again, this is just write-availability, you can
> happily read from all active standbies.

I believe many systems require write-availability.

>> Basically we need to take a base backup from new master to start
>> the standbys and make them connect to new master. This might take
>> a long time. Since transaction commits cannot advance for that time,
>> availability would goes down.
>
> Just don't increase your quorum_commit to unreasonable values which your
> hardware cannot possible satisfy. It doesn't make sense to set a
> quorum_commit of 1 or even bigger, if you don't already have a standby
> attached.
>
> Start with 0 (i.e. replication off), then add standbies, then increase
> quorum_commit to your new requirements.

No. This only makes the procedure of failover more complex.

>> Or you think that wait-forever option is applied only when the
>> standby goes down?
>
> That wouldn't work in case of a full-cluster crash, where the
> wait-forever option is required again. Otherwise you risk a split-brain
> situation.

What is a full-cluster crash? Why does it cause a split-brain?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 14:55:49
Message-ID:	AANLkTinqN45iL+a9Rm7O0PDK3Zu_1vxnjP5qqWXXaex3@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 8, 2010 at 6:00 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> >From the perspective of an observer, randomly selecting a standby for
> load balancing purposes: No, they are not guaranteed to see the "latest"
> answer, nor even can they find out whether what they are seeing is the
> latest answer.

To guarantee that each standby returns the same result, we would need to
use the cluster-wide snapshot to run queries. IIRC, Postgres-XC provides
that feature. Though I'm not sure if it can be applied in HS/SR.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 14:57:21
Message-ID:	1286549841.2304.948.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2010-10-08 at 23:55 +0900, Fujii Masao wrote:
> On Fri, Oct 8, 2010 at 6:00 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > >From the perspective of an observer, randomly selecting a standby for
> > load balancing purposes: No, they are not guaranteed to see the "latest"
> > answer, nor even can they find out whether what they are seeing is the
> > latest answer.
>
> To guarantee that each standby returns the same result, we would need to
> use the cluster-wide snapshot to run queries. IIRC, Postgres-XC provides
> that feature. Though I'm not sure if it can be applied in HS/SR.

That is my understanding.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 15:06:26
Message-ID:	4CAF3372.8040508@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/08/2010 04:47 PM, Simon Riggs wrote:
> Yes, I really want to avoid such issues and likely complexities we get
> into trying to solve them. In reality they should not be common because
> it only happens if the sysadmin has not configured sufficient number of
> redundant standbys.

Well, full cluster outages are infrequent, but sadly cannot be avoided
entirely. (Murphy's laughing). IMO we should be prepared to deal with
those. Or am I understanding you wrongly here?

> I don't
> think we should be spending too much time trying to help people that say
> they want additional durability guarantees but do not match that with
> sufficient hardware resources to make it happen smoothly.

I fully agree to that statement.

Regards

Markus Wanner

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 15:12:49
Message-ID:	4CAF34F1.8000502@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/08/2010 04:48 PM, Fujii Masao wrote:
> I believe many systems require write-availability.

Sure. Make sure you have enough standbies to fail over to.

(I think there are even more situations where read-availability is much
more important, though).

>> Start with 0 (i.e. replication off), then add standbies, then increase
>> quorum_commit to your new requirements.
>
> No. This only makes the procedure of failover more complex.

Huh? This doesn't affect fail-over at all. Quite the opposite, the
guarantees and requirements remain the same even after a fail-over.

> What is a full-cluster crash?

The event that all of your cluster nodes are down (most probably due to
power failure, but fires or other catastrophic events can be other
causes). Chances for that to happen can certainly be reduced by
distributing to distant locations, but that equally certainly increases
latency, which isn't always an option.

> Why does it cause a split-brain?

First master node A fails, a standby B takes over, but then fails as
well. Let node C take over. Then the power aggregates catches fire, the
infamous full-cluster crash (where "lights out management" gets a
completely new meaning ;-) ).

Split brain would be the situation that arises if all three nodes (A, B
and C) start up again and think they have been the former master, so
they can now continue to apply new transactions. Their data diverges,
leading to what could be seen as a split-brain from the outside.

Obviously, you must disallow A and B to take the role of the master
after recovery. Ideally, C would continue as the master. However, if the
fire destroyed node C, let's hope you had another (sync!) standby that
can act as the new master. Otherwise you've lost data.

Hope that explains it. Wikipedia certainly provides a better (and less
Postgres colored) explanation.

Regards

Markus Wanner

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 16:41:08
Message-ID:	4CAF49A4.8070204@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> And, I'd like to know whether the master waits forever because of the
> standby failure in other solutions such as Oracle DataGuard, MySQL
> semi-synchronous replication.

MySQL used to be fond of simiply failing sliently. Not sure what 5.4
does, or Oracle. In any case MySQL's replication has always really been
async (except Cluster, which is a very different database), so it's not
really a comparison.

Here's the comparables:
Oracle DataGuard
DRBD
SQL Server
DB2

If anyone knows what the above do by default, please speak up!

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Rob Wultsch <wultsch(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 18:33:43
Message-ID:	AANLkTikshMusKT0HCYuVyaXFHipRKLR+7TF3Aw0MFYH1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/8/10, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Fri, Oct 8, 2010 at 5:10 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Do we really need that?
>
> Yes. But if there is no unsent WAL when the master goes down,
> we can start new standby without new backup by copying the
> timeline history file from new master to new standby and
> setting recovery_target_timeline to 'latest'. In this case,
> new standby advances the recovery to the latest timeline ID
> which new master uses before connecting to the master.
>
> This seems to have been successful in my test environment.
> Though I'm missing something.
>
>> I don't think that's acceptable, we'll need to fix
>> that if that's the case.
>
> Agreed.
>
>> You can cross timelines with the archive, though. But IIRC there was some
>> issue with that too, you needed to restart the standbys because the
>> standby
>> scans what timelines exist at the beginning of recovery, and won't notice
>> new timelines that appear after that?
>
> Yes.
>
>> We need to address that, apart from any of the other things discussed wrt.
>> synchronous replication. It will benefit asynchronous replication too.
>> IMHO
>> *that* is the next thing we should do, the next patch we commit.
>
> You mean to commit that capability before synchronous replication? If so,
> I disagree with you. I think that it's not easy to address that problem.
> So I'm worried about that implementing that capability first means the miss
> of sync rep in 9.1.
>
> Regards,
>
> --
> Fujii Masao
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Rob Wultsch
wultsch(at)gmail(dot)com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 19:31:58
Message-ID:	4CAF71AE.5040006@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08.10.2010 17:26, Fujii Masao wrote:
> On Fri, Oct 8, 2010 at 5:10 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Do we really need that?
>
> Yes. But if there is no unsent WAL when the master goes down,
> we can start new standby without new backup by copying the
> timeline history file from new master to new standby and
> setting recovery_target_timeline to 'latest'.

.. and restart the standby.

> In this case,
> new standby advances the recovery to the latest timeline ID
> which new master uses before connecting to the master.
>
> This seems to have been successful in my test environment.
> Though I'm missing something.

Yeah, that should work, but it's awfully complicated.

>> I don't think that's acceptable, we'll need to fix
>> that if that's the case.
>
> Agreed.
>
>> You can cross timelines with the archive, though. But IIRC there was some
>> issue with that too, you needed to restart the standbys because the standby
>> scans what timelines exist at the beginning of recovery, and won't notice
>> new timelines that appear after that?
>
> Yes.
>
>> We need to address that, apart from any of the other things discussed wrt.
>> synchronous replication. It will benefit asynchronous replication too. IMHO
>> *that* is the next thing we should do, the next patch we commit.
>
> You mean to commit that capability before synchronous replication? If so,
> I disagree with you. I think that it's not easy to address that problem.
> So I'm worried about that implementing that capability first means the miss
> of sync rep in 9.1.

It's a pretty severe shortcoming at the moment. For starters, it means
that you need a shared archive, even if you set wal_keep_segments to a
high number. Secondly, it's a lot of scripting to get it working, I
don't like the thought of testing failovers in synchronous replication
if I have to do all that. Frankly, this seems more important to me than
synchronous replication.

It shouldn't be too hard to fix. Walsender needs to be able to read WAL
from preceding timelines, like recovery does, and walreceiver needs to
write the incoming WAL to the right file.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 20:04:05
Message-ID:	4CAF7935.7050001@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner wrote:
> ..and how do you make sure you are not marking your second standby as
> degraded just because it's currently lagging? Effectively degrading the
> utterly needed one, because your first standby has just bitten the dust?
>

People are going to monitor the standby lag. If it gets excessive
relative to where it's approaching the known timeout, the flashing
yellow lights should go off at this point, before it gets this bad. And
if you've set a reasonable business oriented timeout on how long you can
stand for the master to be held up waiting for a lagging standby, the
right thing to do may very well be to cut it off. At some point people
will want to stop waiting for a standby if it's taking so long to commit
that it's interfering with the ability of the master to operate
normally. Such a master is already degraded, if your performance
metrics for availability includes processing transactions in a timely
manner.

> And how do you prevent the split brain situation in case the master dies
> shortly after these events, but fails to come up again immediately?
>

How is that a new problem? It's already possible to end up with a
standby pair that has suffered through some bizarre failure chain such
that it's not necessarily obvious which of the two systems has the most
recent set of data on it. And that's not this project's problem to
solve. Useful answers to the split brain problem involve fencing
implementations that normally drop to the hardware level, and clustering
solutions including those features are already available that PostgreSQL
can integrate into. Assuming you have to solve this in order to deliver
a useful database replication component is excessively ambitious.

You seem to be under the assumption that a more complicated replication
implementation here will make reaching a bad state impossible. I think
that's optimistic, both in theory and in regards to how successful code
gets built. Here's the thing: the difficultly of testing to prove your
code actually works is also proportional to that complexity. This
project can chose to commit and potentially ship a simple solution that
has known limitations, and expect that people will fill in the gap with
existing add-on software to handle the clustering parts it doesn't:
fencing, virtual IP address assignment, etc. All while getting useful
testing feedback on the simple bottom layer, whose main purpose in life
is to transport WAL data synchronously. Or, we can argue in favor of
adding additional complexity on top first instead, so we end up with
layers and layers of untested code. That path leads to situations where
you're lucky to ship at all, and when you do the result is difficult to
support.

--
Greg Smith, 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 20:34:47
Message-ID:	4CAF8067.9090406@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> How are you going to "mark the standby as degraded"? The
> standby can't keep that information, because it's not even connected
> when the master makes the decision.

From a high level, I'm assuming only that the master has a list in
memory of the standby system(s) it believes are up to date, and that it
is supposed to commit to synchronously. When I say mark as degraded, I
mean that the master merely closes whatever communications channel it
had open with that system and removes the standby from that list.

If that standby now reconnects again, I don't see how resolving what
happens at that point is any different from when a standby is first
started after both systems were turned off. If the standby is current
with the data available on the master when it has an initial
conversation, great; it's now available for synchronous commit too
then. If it's not, it goes into a catchup mode first instead. When the
master sees you're back to current again, if you're on the list of sync
servers too you go back onto the list of active sync systems.

There's shouldn't be any state information to save here. If the master
and standby can't figure out if they are in or out of sync with one
another based on the conversation they have when they first connect to
one another, that suggests to me there needs to be improvements made in
the communications protocol they use to exchange messages.

--
Greg Smith, 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 20:47:27
Message-ID:	1286570847.2304.1015.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2010-10-08 at 17:06 +0200, Markus Wanner wrote:
> Well, full cluster outages are infrequent, but sadly cannot be avoided
> entirely. (Murphy's laughing). IMO we should be prepared to deal with
> those.

I've described how I propose to deal with those. I'm not waving away
these issues, just proposing that we consciously choose simplicity and
therefore robustness.

Let me say it again for clarity. (This is written for the general case,
though my patch uses only k=1 i.e. one acknowledgement):

If we want robustness, we have multiple standbys. So if you lose one,
you continue as normal without interruption. That is the first and most
important line of defence - not software.

When we start to wait, if there aren't sufficient active standbys to
acknowledge a commit, then the commit won't wait. This behaviour helps
us avoid situations where we are hours or days away from having a
working standby to acknowledge the commit. We've had a long debate about
servers that "ought to be there" but aren't; I suggest we treat standbys
that aren't there as having a strong possibility they won't come back,
and hence not worth waiting for. Heikki disagrees; I have no problem
with adding server registration so that we can add additional waits, but
I doubt that the majority of users prefer waiting over availability. It
can be an option

Once we are waiting, if insufficient standbys acknowledge the commit we
will wait until the timeout expires, after which we commit and continue
working. If you don't like timeouts, set the timeout to 0 to wait
forever. This behaviour is designed to emphasise availability. (I
acknowledge that some people are so worried by data loss that they would
choose to stop changes altogether, and accept unavailability; I regard
that as a minority use case, but one which I would not argue against
including as an options at some point in the future.)

To cover Dimitri's observation that when a streaming standby first
connects it might take some time before it can sensibly acknowledge, we
don't activate the standby until it has caught up. Once caught up, it
will advertise it's capability to offer a sync rep service. Standbys
that don't wish to be failover targets can set
synchronous_replication_service = off.

The paths between servers aren't defined explicitly, so the parameters
all still work even after failover.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-08 21:05:09
Message-ID:	1286571909.2304.1026.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2010-10-08 at 16:34 -0400, Greg Smith wrote:
> Tom Lane wrote:
> > How are you going to "mark the standby as degraded"? The
> > standby can't keep that information, because it's not even connected
> > when the master makes the decision.
>
> From a high level, I'm assuming only that the master has a list in
> memory of the standby system(s) it believes are up to date, and that it
> is supposed to commit to synchronously. When I say mark as degraded, I
> mean that the master merely closes whatever communications channel it
> had open with that system and removes the standby from that list.

My current coding works with two sets of parameters:

The "master marks standby as degraded" is handled by the tcp keepalives.
When it notices no response, it kicks out the standby. We already had
this, so I never mentioned it before as being part of the solution.

The second part is the synchronous_replication_timeout which is a user
settable parameter defining how long the app is prepared to wait, which
could be more or less time than the keepalives.

> If that standby now reconnects again, I don't see how resolving what
> happens at that point is any different from when a standby is first
> started after both systems were turned off. If the standby is current
> with the data available on the master when it has an initial
> conversation, great; it's now available for synchronous commit too
> then. If it's not, it goes into a catchup mode first instead. When the
> master sees you're back to current again, if you're on the list of sync
> servers too you go back onto the list of active sync systems.
>
> There's shouldn't be any state information to save here. If the master
> and standby can't figure out if they are in or out of sync with one
> another based on the conversation they have when they first connect to
> one another, that suggests to me there needs to be improvements made in
> the communications protocol they use to exchange messages.

Agreed.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-11 11:47:30
Message-ID:	4CB2F952.8050908@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg,

to me it looks like we have very similar goals, but start from different
preconditions. I absolutely agree with you given the preconditions you
named.

On 10/08/2010 10:04 PM, Greg Smith wrote:
> How is that a new problem? It's already possible to end up with a
> standby pair that has suffered through some bizarre failure chain such
> that it's not necessarily obvious which of the two systems has the most
> recent set of data on it. And that's not this project's problem to
> solve.

Thanks for pointing that out. I think that might not have been clear to
me. This limitation of scope certainly make sense for the Postgres
project in general.

Regards

Markus Wanner

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-13 04:43:57
Message-ID:	AANLkTikb3xu9pQwHrm6gxjcrZXbpxK5PhZWPnyZj6yHE@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Oct 9, 2010 at 12:12 AM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
> On 10/08/2010 04:48 PM, Fujii Masao wrote:
>> I believe many systems require write-availability.
>
> Sure. Make sure you have enough standbies to fail over to.

Unfortunately even enough standbys don't increase write-availability
unless you choose wait-forever. Because, after promoting one of
standbys to new master, you must keep all the transactions waiting
until at least one standby has connected to and caught up with new
master. Currently this wait time is not short.

> (I think there are even more situations where read-availability is much
> more important, though).

Even so, we should not ignore the write-availability aspect.

>>> Start with 0 (i.e. replication off), then add standbies, then increase
>>> quorum_commit to your new requirements.
>>
>> No. This only makes the procedure of failover more complex.
>
> Huh? This doesn't affect fail-over at all. Quite the opposite, the
> guarantees and requirements remain the same even after a fail-over.

Hmm.. that increases the number of procedures which the users must
perform at the failover. At least, the users seem to have to wait
until the standby has caught up with new master, increase quorum_commit
and then reload the configuration file.

>> What is a full-cluster crash?
>
> The event that all of your cluster nodes are down (most probably due to
> power failure, but fires or other catastrophic events can be other
> causes). Chances for that to happen can certainly be reduced by
> distributing to distant locations, but that equally certainly increases
> latency, which isn't always an option.

Yep.

>> Why does it cause a split-brain?
>
> First master node A fails, a standby B takes over, but then fails as
> well. Let node C take over. Then the power aggregates catches fire, the
> infamous full-cluster crash (where "lights out management" gets a
> completely new meaning ;-) ).
>
> Split brain would be the situation that arises if all three nodes (A, B
> and C) start up again and think they have been the former master, so
> they can now continue to apply new transactions. Their data diverges,
> leading to what could be seen as a split-brain from the outside.
>
> Obviously, you must disallow A and B to take the role of the master
> after recovery.

Yep. Something like STONITH would be required.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-13 04:47:31
Message-ID:	AANLkTi=t5ZmYC-fabXQKVxm=BBWGTy2smF9=UnAvghrE@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Oct 9, 2010 at 1:41 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
>> And, I'd like to know whether the master waits forever because of the
>> standby failure in other solutions such as Oracle DataGuard, MySQL
>> semi-synchronous replication.
>
> MySQL used to be fond of simiply failing sliently. Not sure what 5.4 does,
> or Oracle. In any case MySQL's replication has always really been async
> (except Cluster, which is a very different database), so it's not really a
> comparison.

IIRC, MySQL *semi-synchronous* replication is not async, so it can be
comparison. Of course, though MySQL default replication is async.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-13 05:21:32
Message-ID:	AANLkTimaqh=zcjiKkRH2mZD2xA9h9QEOv0ec2mh=BODD@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Yes. But if there is no unsent WAL when the master goes down,
>> we can start new standby without new backup by copying the
>> timeline history file from new master to new standby and
>> setting recovery_target_timeline to 'latest'.
>
> .. and restart the standby.

Yes.

> It's a pretty severe shortcoming at the moment. For starters, it means that
> you need a shared archive, even if you set wal_keep_segments to a high
> number. Secondly, it's a lot of scripting to get it working, I don't like
> the thought of testing failovers in synchronous replication if I have to do
> all that. Frankly, this seems more important to me than synchronous
> replication.

There seems to be difference in outlook between us. I prefer sync rep.
But I'm OK to address that first if it's not hard.

> It shouldn't be too hard to fix. Walsender needs to be able to read WAL from
> preceding timelines, like recovery does, and walreceiver needs to write the
> incoming WAL to the right file.

And walsender seems to need to transfer the current timeline history to
the standby. Otherwise, the standby cannot recover the WAL file with new
timeline. And the standby might need to create the timeline history file
in order to recover the WAL file with new timeline even after it's restarted.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-13 06:43:36
Message-ID:	4CB55518.9060107@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13.10.2010 08:21, Fujii Masao wrote:
> On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> It shouldn't be too hard to fix. Walsender needs to be able to read WAL from
>> preceding timelines, like recovery does, and walreceiver needs to write the
>> incoming WAL to the right file.
>
> And walsender seems to need to transfer the current timeline history to
> the standby. Otherwise, the standby cannot recover the WAL file with new
> timeline. And the standby might need to create the timeline history file
> in order to recover the WAL file with new timeline even after it's restarted.

Yes, true, you need that too.

It might be good to divide this work into two phases, teaching archive
recovery to notice new timelines appearing in the archive first, and
doing the walsender/walreceiver changes after that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-13 06:50:06
Message-ID:	AANLkTikfUgHhs=sBP2N1xPQZ4TECgC6wE3CU1JZBrpJj@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 13, 2010 at 2:43 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 13.10.2010 08:21, Fujii Masao wrote:
>>
>> On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>
>>> It shouldn't be too hard to fix. Walsender needs to be able to read WAL
>>> from
>>> preceding timelines, like recovery does, and walreceiver needs to write
>>> the
>>> incoming WAL to the right file.
>>
>> And walsender seems to need to transfer the current timeline history to
>> the standby. Otherwise, the standby cannot recover the WAL file with new
>> timeline. And the standby might need to create the timeline history file
>> in order to recover the WAL file with new timeline even after it's
>> restarted.
>
> Yes, true, you need that too.
>
> It might be good to divide this work into two phases, teaching archive
> recovery to notice new timelines appearing in the archive first, and doing
> the walsender/walreceiver changes after that.

There's another problem here we should think about, too. Suppose you
have a master and two standbys. The master dies. You promote one of
the standbys, which turns out to be behind the other. You then
repoint the other standby at the one you promoted. Congratulations,
your database is now very possible corrupt, and you may very well get
no warning of that fact. It seems to me that we would be well-advised
to install some kind of bullet-proof safeguard against this kind of
problem, so that you will KNOW that the standby needs to be re-synced.
I mention this because I have a vague feeling that timelines are
supposed to prevent you from getting different WAL histories confused
with each other, but they don't actually cover all the cases that can
happen.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-13 08:05:14
Message-ID:	4CB5683A.1000300@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/13/2010 06:43 AM, Fujii Masao wrote:
> Unfortunately even enough standbys don't increase write-availability
> unless you choose wait-forever. Because, after promoting one of
> standbys to new master, you must keep all the transactions waiting
> until at least one standby has connected to and caught up with new
> master. Currently this wait time is not short.

Why is that? Don't the standbies just have to switch from one walsender
to another? If there's any significant delay in switching, this either
hurts availability or robustness, yes.

> Hmm.. that increases the number of procedures which the users must
> perform at the failover.

I only consider fully automated failover. However, you seem to be
worried about the initial setup of sync rep.

> At least, the users seem to have to wait
> until the standby has caught up with new master, increase quorum_commit
> and then reload the configuration file.

For switching from a single node to a sync replication setup with one or
more standbies, that seems reasonable. There are way more components you
need to setup or adjust in such a case (network, load balancer, alerting
system and maybe even the application itself).

There's really no other option, if you want the kind of robustness
guarantee that sync rep with wait forever provides. OTOH, if you just
replicate to whatever standby is there and don't care much if it isn't,
the admin doesn't need to worry much about quorum_commit - it doesn't
have much of an effect anyway.

Regards

Markus Wanner

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-13 09:04:44
Message-ID:	AANLkTinJmX6qWyzunahChkdrmYSK_C-S+WFw91gWu+FA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 13, 2010 at 3:43 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 13.10.2010 08:21, Fujii Masao wrote:
>>
>> On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>
>>> It shouldn't be too hard to fix. Walsender needs to be able to read WAL
>>> from
>>> preceding timelines, like recovery does, and walreceiver needs to write
>>> the
>>> incoming WAL to the right file.
>>
>> And walsender seems to need to transfer the current timeline history to
>> the standby. Otherwise, the standby cannot recover the WAL file with new
>> timeline. And the standby might need to create the timeline history file
>> in order to recover the WAL file with new timeline even after it's
>> restarted.
>
> Yes, true, you need that too.
>
> It might be good to divide this work into two phases, teaching archive
> recovery to notice new timelines appearing in the archive first, and doing
> the walsender/walreceiver changes after that.

OK. In detail,

1. After failover, when the standby connects to new master, walsender transfers
the current timeline history in the handshake processing.

2. If the timeline history in the master is inconsistent with that in
the standby,
walreceiver terminates the replication connection.

3. Walreceiver creates the timeline history file.

4. Walreceiver signals the change of timeline history to startup process and
makes it read the timeline history file. After this, startup process tries
to recover the WAL files with even new timeline ID.

5. After the handshake, walsender sends the WAL from preceding timelines,
like recovery does, and walreceiver writes the incoming WAL to the right
file.

Am I missing something?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-13 09:22:41
Message-ID:	AANLkTimGC3i2=dge3EA3cZiNv_yHf3GHz7VtUqPOB7T_@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 13, 2010 at 3:50 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> There's another problem here we should think about, too. Suppose you
> have a master and two standbys. The master dies. You promote one of
> the standbys, which turns out to be behind the other. You then
> repoint the other standby at the one you promoted. Congratulations,
> your database is now very possible corrupt, and you may very well get
> no warning of that fact. It seems to me that we would be well-advised
> to install some kind of bullet-proof safeguard against this kind of
> problem, so that you will KNOW that the standby needs to be re-synced.

Yep. This is why I said it's not easy to implement that.

To start the standby without taking a base backup from new master after
failover, the user basically has to promote the standby which is ahead
of the other standbys (e.g., by comparing pg_last_xlog_replay_location
on each standby).

As the safeguard, we seem to need to compare the location at the switch
of the timeline on the master with the last replay location on the standby.
If the latter location is ahead AND the timeline ID of the standby is not
the same as that of the master, we should emit warning and terminate the
replication connection.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-13 21:44:00
Message-ID:	AANLkTikccYZmZCBs6U912_r0XKYD16Ynx1F-k8VZRoBW@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 13, 2010 at 5:22 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Wed, Oct 13, 2010 at 3:50 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> There's another problem here we should think about, too. Suppose you
>> have a master and two standbys. The master dies. You promote one of
>> the standbys, which turns out to be behind the other. You then
>> repoint the other standby at the one you promoted. Congratulations,
>> your database is now very possible corrupt, and you may very well get
>> no warning of that fact. It seems to me that we would be well-advised
>> to install some kind of bullet-proof safeguard against this kind of
>> problem, so that you will KNOW that the standby needs to be re-synced.
>
> Yep. This is why I said it's not easy to implement that.
>
> To start the standby without taking a base backup from new master after
> failover, the user basically has to promote the standby which is ahead
> of the other standbys (e.g., by comparing pg_last_xlog_replay_location
> on each standby).
>
> As the safeguard, we seem to need to compare the location at the switch
> of the timeline on the master with the last replay location on the standby.
> If the latter location is ahead AND the timeline ID of the standby is not
> the same as that of the master, we should emit warning and terminate the
> replication connection.

That doesn't seem very bullet-proof. You can accidentally corrupt a
standby even when only one time-line is involved. AFAIK, stopping a
standby, removing recovery.conf, and starting it up again does not
change time lines. You can even shut down the standby, bring it up as
a master, generate a little WAL, shut it back down, and bring it back
up as a standby pointing to the same master. It would be nice to
embed in each checkpoint record an identifier that changes randomly on
each transition to normal running, so that if you do something like
this we can notice and complain loudly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-14 02:18:44
Message-ID:	AANLkTikECrMuZ1nXs+byQbch=Xj-fxD7yuhOW5Zbgata@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 12, 2010 at 11:50 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> There's another problem here we should think about, too. Suppose you
> have a master and two standbys. The master dies. You promote one of
> the standbys, which turns out to be behind the other. You then
> repoint the other standby at the one you promoted. Congratulations,
> your database is now very possible corrupt, and you may very well get
> no warning of that fact. It seems to me that we would be well-advised
> to install some kind of bullet-proof safeguard against this kind of
> problem, so that you will KNOW that the standby needs to be re-synced.
> I mention this because I have a vague feeling that timelines are
> supposed to prevent you from getting different WAL histories confused
> with each other, but they don't actually cover all the cases that can
> happen.
>

Why don't the usual protections kick in here? The new record read from
the location the xlog reader is expecting to find it has to have a
valid CRC and a correct back pointer to the previous record. If the
new wal sender is behind the old one then the new record it's sent
won't match up at all.

--
greg

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-14 11:47:35
Message-ID:	AANLkTi=dkPjr7LVnzpDgKC9=fY3YVFwy6ehHd9H9hAzN@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 14, 2010 at 11:18 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> Why don't the usual protections kick in here? The new record read from
> the location the xlog reader is expecting to find it has to have a
> valid CRC and a correct back pointer to the previous record.

Yep. In most cases, those protections seem to be able to make the standby
notice the inconsistency of WAL and then give up continuing replication.
But not in all the cases. We can regard those protections as bullet-proof
safeguard?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-14 12:19:07
Message-ID:	AANLkTimPgK=HOg6U+=bkRNKQ_EWkS7zVK2srV100sOFb@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 13, 2010 at 10:18 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Tue, Oct 12, 2010 at 11:50 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> There's another problem here we should think about, too. Suppose you
>> have a master and two standbys. The master dies. You promote one of
>> the standbys, which turns out to be behind the other. You then
>> repoint the other standby at the one you promoted. Congratulations,
>> your database is now very possible corrupt, and you may very well get
>> no warning of that fact. It seems to me that we would be well-advised
>> to install some kind of bullet-proof safeguard against this kind of
>> problem, so that you will KNOW that the standby needs to be re-synced.
>> I mention this because I have a vague feeling that timelines are
>> supposed to prevent you from getting different WAL histories confused
>> with each other, but they don't actually cover all the cases that can
>> happen.
>>
>
> Why don't the usual protections kick in here? The new record read from
> the location the xlog reader is expecting to find it has to have a
> valid CRC and a correct back pointer to the previous record. If the
> new wal sender is behind the old one then the new record it's sent
> won't match up at all.

There's some kind of logic that rewinds to the beginning of the WAL
segment and tries to replay from there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Issues with Quorum Commit
Date:	2010-10-21 00:49:06
Message-ID:	201010210049.o9L0n6114296@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> > I don't see this as needing any implementation any more complicated than
> > the usual way such timeouts are handled. Note how long you've been
> > trying to reach the standby. Default to -1 for forever. And if you hit
> > the timeout, mark the standby as degraded and force them to do a proper
> > resync when they disconnect. Once that's done, then they can re-enter
> > sync rep mode again, via the same process a new node would have done so.
>
> Well, actually, that's *considerably* more complicated than just a
> timeout. How are you going to "mark the standby as degraded"? The
> standby can't keep that information, because it's not even connected
> when the master makes the decision. ISTM that this requires
>
> 1. a unique identifier for each standby (not just role names that
> multiple standbys might share);
>
> 2. state on the master associated with each possible standby -- not just
> the ones currently connected.
>
> Both of those are perhaps possible, but the sense I have of the
> discussion is that people want to avoid them.
>
> Actually, #2 seems rather difficult even if you want it. Presumably
> you'd like to keep that state in reliable storage, so it survives master
> crashes. But how you gonna commit a change to that state, if you just
> lost every standby (suppose master's ethernet cable got unplugged)?
> Looks to me like it has to be reliable non-replicated storage. Leaving
> aside the question of how reliable it can really be if not replicated,
> it's still the case that we have noplace to put such information given
> the WAL-is-across-the-whole-cluster design.

I assumed we would have a parameter called "sync_rep_failure" that would
take a command and the command would be called when communication to the
slave was lost. If you restart, it tries again and might call the
function again.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +