Quick Links

Re: Sync Rep v19

Lists:	pgsql-hackers

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Sync Rep v19
Date:	2011-03-03 10:53:08
Message-ID:	1299149588.1974.9651.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Latest version of Sync Rep, which includes substantial internal changes
and simplifications from previous version. (25-30 changes).

Includes all outstanding technical comments, typos and docs. I will
continue to work on self review and test myself, though actively
encourage others to test and report issues.

Interesting changes

* docs updated

* names listed in synchronous_standby_names are now in priority order

* synchronous_standby_names = "*" matches all standby names

* pg_stat_replication now shows standby priority - this is an ordinal
number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync
standby".

The only *currently* outstanding point of discussion is the "when to
wait" debate, which we aren't moving quickly towards consensus on at
this stage. I see that as a "How should it work?" debate and something
we can chew over during Alpha/Beta, not as an immediate blocker to
commit.

Please comment on the patch and also watch changes to the repo
git://github.com/simon2ndQuadrant/postgres.git

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment	Content-Type	Size
sync_rep.v19.context.patch	text/x-patch	62.2 KB

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-03 15:02:43
Message-ID:	AANLkTimrATFERqbNZij2mb7uFBYZsVnqiFXGDJtdDqqA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 3, 2011 at 7:53 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Latest version of Sync Rep, which includes substantial internal changes
> and simplifications from previous version. (25-30 changes).
>
> Includes all outstanding technical comments, typos and docs. I will
> continue to work on self review and test myself, though actively
> encourage others to test and report issues.

Thanks for the patch!

> * synchronous_standby_names = "*" matches all standby names

Using '*' as the default seems to lead the performance degradation by
being connected from unexpected synchronous standby.

> * pg_stat_replication now shows standby priority - this is an ordinal
> number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync
> standby".

monitoring.sgml should be updated.

Though I've not read whole of the patch yet, here is the current comment:

Using MyProc->lwWaiting and lwWaitLink for backends to wait for replication
looks fragile. Since they are used also by lwlock, the value of them can be
changed unexpectedly. Instead, how about defining dedicated variables for
replication?

+ else if (WaitingForSyncRep)
+ {
+ /*
+ * This must NOT be a FATAL message. We want the state of the
+ * transaction being aborted to be indeterminate to ensure that
+ * the transaction completion guarantee is never broken.
+ */

The backend can reach this code path after returning the commit to the client.
Instead, how about doing this in EndCommand, to close the connection before
returning the commit?

+ LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ sync_priority = walsnd->sync_standby_priority;
+ LWLockRelease(SyncRepLock);

LW_SHARE can be used here, instead.

+ /*
+ * Wait no longer if we have already reached our LSN
+ */
+ if (XLByteLE(XactCommitLSN, queue->lsn))
+ {
+ /* No need to wait */
+ LWLockRelease(SyncRepLock);
+ return;
+ }

It might take long to acquire SyncRepLock, so how about comparing
our LSN with WalSnd->flush before here?

replication_timeout_client depends on GetCurrentTransactionStopTimestamp().
In COMMIT case, it's OK. But In PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED cases, it seems problematic because they don't call
SetCurrentTransactionStopTimestamp().

In SyncRepWaitOnQueue, the backend can theoretically call WaitLatch() again
after the wake-up from the latch. In this case, the "timeout" should
be calculated
again. Otherwise, it would take unexpectedly very long to cause the timeout.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-03 16:09:03
Message-ID:	1299168543.10703.38.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote:
> > * synchronous_standby_names = "*" matches all standby names
>
> Using '*' as the default seems to lead the performance degradation by
> being connected from unexpected synchronous standby.

You can configure it however you wish. It seemed better to have an out
of the box setting that was useful.

> > * pg_stat_replication now shows standby priority - this is an ordinal
> > number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync
> > standby".
>
> monitoring.sgml should be updated.

Didn't think it needed to be, but I've added a few lines to explain.

> Though I've not read whole of the patch yet, here is the current comment:
>
> Using MyProc->lwWaiting and lwWaitLink for backends to wait for replication
> looks fragile. Since they are used also by lwlock, the value of them can be
> changed unexpectedly. Instead, how about defining dedicated variables for
> replication?

Yes, I think the queue stuff needs a rewrite now.

> + else if (WaitingForSyncRep)
> + {
> + /*
> + * This must NOT be a FATAL message. We want the state of the
> + * transaction being aborted to be indeterminate to ensure that
> + * the transaction completion guarantee is never broken.
> + */
>
> The backend can reach this code path after returning the commit to the client.
> Instead, how about doing this in EndCommand, to close the connection before
> returning the commit?

OK, will look.

> + LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
> + sync_priority = walsnd->sync_standby_priority;
> + LWLockRelease(SyncRepLock);
>
> LW_SHARE can be used here, instead.

Seemed easier to keep it simple and have all lockers use LW_EXCLUSIVE.
But I've changed it for you.

> + /*
> + * Wait no longer if we have already reached our LSN
> + */
> + if (XLByteLE(XactCommitLSN, queue->lsn))
> + {
> + /* No need to wait */
> + LWLockRelease(SyncRepLock);
> + return;
> + }
>
> It might take long to acquire SyncRepLock, so how about comparing
> our LSN with WalSnd->flush before here?

If we're not the sync standby and we need to takeover the role of sync
standby we may need to issue a wakeup even though our standby reached
that LSN some time before. So we need to check each time.

> replication_timeout_client depends on GetCurrentTransactionStopTimestamp().
> In COMMIT case, it's OK. But In PREPARE TRANSACTION, COMMIT PREPARED
> and ROLLBACK PREPARED cases, it seems problematic because they don't call
> SetCurrentTransactionStopTimestamp().

Shame on them!

Seems reasonable that they should call
SetCurrentTransactionStopTimestamp().

I don't want to make a special case there for prepared transactions.

> In SyncRepWaitOnQueue, the backend can theoretically call WaitLatch() again
> after the wake-up from the latch. In this case, the "timeout" should
> be calculated
> again. Otherwise, it would take unexpectedly very long to cause the timeout.

That was originally modelled on on the way the statement_timeout timer
works. If it gets nudged and wakes up too early it puts itself back to
sleep to wakeup at the same time again.

I've renamed the variables to make that clearer and edited slightly.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-03 17:51:11
Message-ID:	m21v2oytww.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote:
>> > * synchronous_standby_names = "*" matches all standby names
>>
>> Using '*' as the default seems to lead the performance degradation by
>> being connected from unexpected synchronous standby.
>
> You can configure it however you wish. It seemed better to have an out
> of the box setting that was useful.

Well the HBA still needs some opening before anyone can claim to be a
standby. I guess the default line would be commented out and no standby
would be accepted as synchronous by default, assuming this GUC is sighup.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-03 18:14:06
Message-ID:	1299176046.10703.392.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2011-03-03 at 18:51 +0100, Dimitri Fontaine wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote:
> >> > * synchronous_standby_names = "*" matches all standby names
> >>
> >> Using '*' as the default seems to lead the performance degradation by
> >> being connected from unexpected synchronous standby.
> >
> > You can configure it however you wish. It seemed better to have an out
> > of the box setting that was useful.
>
> Well the HBA still needs some opening before anyone can claim to be a
> standby. I guess the default line would be commented out and no standby
> would be accepted as synchronous by default, assuming this GUC is sighup.

The patch sets "*" as the default, so all standbys are synchronous by
default.

Would you prefer it if it was blank, meaning no standbys are
synchronous, by default?

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-03 18:23:48
Message-ID:	AANLkTim+UCAS_cyGT_0SFhcGS61cAM3BcvroOR8t8WfM@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 3, 2011 at 1:14 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Thu, 2011-03-03 at 18:51 +0100, Dimitri Fontaine wrote:
>> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> > On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote:
>> >> > * synchronous_standby_names = "*" matches all standby names
>> >>
>> >> Using '*' as the default seems to lead the performance degradation by
>> >> being connected from unexpected synchronous standby.
>> >
>> > You can configure it however you wish. It seemed better to have an out
>> > of the box setting that was useful.
>>
>> Well the HBA still needs some opening before anyone can claim to be a
>> standby. I guess the default line would be commented out and no standby
>> would be accepted as synchronous by default, assuming this GUC is sighup.
>
> The patch sets "*" as the default, so all standbys are synchronous by
> default.
>
> Would you prefer it if it was blank, meaning no standbys are
> synchronous, by default?

I think * is a reasonable default.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-03 19:17:53
Message-ID:	1299179873.10703.545.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote:
> + else if (WaitingForSyncRep)
> + {
> + /*
> + * This must NOT be a FATAL message. We want
> the state of the
> + * transaction being aborted to be
> indeterminate to ensure that
> + * the transaction completion guarantee is
> never broken.
> + */
>
> The backend can reach this code path after returning the commit to the
> client.
> Instead, how about doing this in EndCommand, to close the connection
> before
> returning the commit?

I don't really understand this comment.

You can't get there after returning the COMMIT message. Once we have
finished waiting we set WaitingForSyncRep = false, before we return to
RecordTransactionCommit() and continue from there.

Anyway, this is code in the interrupt handler and only gets executed
when we receive SIGTERM for a fast shutdown.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-03 21:27:35
Message-ID:	4D7007C7.4070001@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2011-03-03 11:53, Simon Riggs wrote:
> Latest version of Sync Rep, which includes substantial internal changes
> and simplifications from previous version. (25-30 changes).
>
> Includes all outstanding technical comments, typos and docs. I will
> continue to work on self review and test myself, though actively
> encourage others to test and report issues.
>
> Interesting changes
>
> * docs updated
>
> * names listed in synchronous_standby_names are now in priority order
>
> * synchronous_standby_names = "*" matches all standby names
>
> * pg_stat_replication now shows standby priority - this is an ordinal
> number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync
> standby".
Some initial remarks:

1) this works nice:
application_name not in synchronous_standby_names -> sync_priority = 0 (OK)
change synchronous_standby_names to default *, reload conf ->
sync_priority = 1 (OK)

message in log file
LOG: 00000: standby "walreceiver" is now the synchronous standby with
priority 1

2) priorities
I have to get used to mapping the integers to synchronous replication
meaning.
0 -> asynchronous
1 -> the synchronous standby that is waited for
2 and higher -> potential syncs

Could it be hidden from the user? I liked asynchronous / synchronous /
potential synchronous

then the log message could be
LOG: 00000: standby "walreceiver" is now the synchronous standby

3) walreceiver is the default application name - could there be problems
when a second standby with that name connects (ofcourse the same
question holds for two the same nondefault application_names)?

regards
Yeb Havinga

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-03 21:49:07
Message-ID:	1299188947.10703.1162.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2011-03-03 at 22:27 +0100, Yeb Havinga wrote:
> On 2011-03-03 11:53, Simon Riggs wrote:
> > Latest version of Sync Rep, which includes substantial internal changes
> > and simplifications from previous version. (25-30 changes).
> >
> > Includes all outstanding technical comments, typos and docs. I will
> > continue to work on self review and test myself, though actively
> > encourage others to test and report issues.
> >
> > Interesting changes
> >
> > * docs updated
> >
> > * names listed in synchronous_standby_names are now in priority order
> >
> > * synchronous_standby_names = "*" matches all standby names
> >
> > * pg_stat_replication now shows standby priority - this is an ordinal
> > number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync
> > standby".
> Some initial remarks:
>
> 1) this works nice:
> application_name not in synchronous_standby_names -> sync_priority = 0 (OK)
> change synchronous_standby_names to default *, reload conf ->
> sync_priority = 1 (OK)
>
> message in log file
> LOG: 00000: standby "walreceiver" is now the synchronous standby with
> priority 1
>
> 2) priorities
> I have to get used to mapping the integers to synchronous replication
> meaning.
> 0 -> asynchronous
> 1 -> the synchronous standby that is waited for
> 2 and higher -> potential syncs
>
> Could it be hidden from the user? I liked asynchronous / synchronous /
> potential synchronous

Yes, that sounds good. I will leave it as it is now to gain other
comments since this need not delay commit.

> then the log message could be
> LOG: 00000: standby "walreceiver" is now the synchronous standby

The priority is mentioned in the LOG message, so you can understand what
happens when multiple standbys connect.

e.g.

if you have synchronous_standby_names = 'a, b, c'

and then the standbys connect in the order b, c, a then you will see log
messages

LOG: standby "b" is now the synchronous standby with priority 2
LOG: standby "a" is now the synchronous standby with priority 1

It's designed so no matter which order standbys arrive in it is the
highest priority standby that makes it to the front in the end.

> 3) walreceiver is the default application name - could there be problems
> when a second standby with that name connects (ofcourse the same
> question holds for two the same nondefault application_names)?

That's documented: in that case which standby is sync is indeterminate.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-03 22:01:07
Message-ID:	8349.1299189667@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> Anyway, this is code in the interrupt handler and only gets executed
> when we receive SIGTERM for a fast shutdown.

I trust it's not getting *directly* executed from the interrupt handler,
at least not without ImmediateInterruptOK.

regards, tom lane

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 04:27:50
Message-ID:	AANLkTikfKNvSwunOf2Nyk+vdTS4H3qFWz=BM5pTAC1kj@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 4, 2011 at 7:01 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> Anyway, this is code in the interrupt handler and only gets executed
>> when we receive SIGTERM for a fast shutdown.
>
> I trust it's not getting *directly* executed from the interrupt handler,
> at least not without ImmediateInterruptOK.

Yes, the backend waits for replication while cancel/die interrupt is
being blocked, i.e., InterruptHoldoffCount > 0. So SIGTERM doesn't
lead the waiting backend to there directly. The backend reaches there
after returning the result.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 04:35:17
Message-ID:	AANLkTinDU=fHfN6mcw=8Q2KCeb9_Z1rhsku4F93fbjVk@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 4, 2011 at 1:27 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Fri, Mar 4, 2011 at 7:01 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>>> Anyway, this is code in the interrupt handler and only gets executed
>>> when we receive SIGTERM for a fast shutdown.
>>
>> I trust it's not getting *directly* executed from the interrupt handler,
>> at least not without ImmediateInterruptOK.
>
> Yes, the backend waits for replication while cancel/die interrupt is
> being blocked, i.e., InterruptHoldoffCount > 0. So SIGTERM doesn't
> lead the waiting backend to there directly. The backend reaches there
> after returning the result.

BTW, this is true in COMMIT and PREPARE cases, and false in
COMMIT PREPARED and ROLLBACK PREPARED cases. In the
latter cases, HOLD_INTERRUPT() is not called before waiting for
replication.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 06:16:08
Message-ID:	1299219368.10703.2838.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2011-03-04 at 13:35 +0900, Fujii Masao wrote:
> On Fri, Mar 4, 2011 at 1:27 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > On Fri, Mar 4, 2011 at 7:01 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> >>> Anyway, this is code in the interrupt handler and only gets executed
> >>> when we receive SIGTERM for a fast shutdown.
> >>
> >> I trust it's not getting *directly* executed from the interrupt handler,
> >> at least not without ImmediateInterruptOK.
> >
> > Yes, the backend waits for replication while cancel/die interrupt is
> > being blocked, i.e., InterruptHoldoffCount > 0. So SIGTERM doesn't
> > lead the waiting backend to there directly. The backend reaches there
> > after returning the result.
>
> BTW, this is true in COMMIT and PREPARE cases,

CommitTransaction() calls HOLD_INTERRUPT() and then RESUME_INTERRUPTS(),
which was reasonable before we started waiting for syncrep. The
interrupt does occur *before* we send the message back, but doesn't work
effectively at interrupting the wait in the way you would like.

If we RESUME_INTERRUPTS() prior to waiting and then HOLD again that
would allow all signals not just SIGTERM. We would need to selectively
reject everything except SIGTERM messages.

Ideas?

Alter ProcessInterrupts() to accept an interrupt if ProcDiePending &&
WaitingForSyncRep and InterruptHoldoffCount > 0. That looks a little
scary, but looks like it will work.

> and false in
> COMMIT PREPARED and ROLLBACK PREPARED cases. In the
> latter cases, HOLD_INTERRUPT() is not called before waiting for
> replication.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment	Content-Type	Size
signal_filter.patch	text/x-patch	868 bytes

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 07:42:11
Message-ID:	AANLkTi=xVKEvVvp6cyMWsr4SgwNAWOuesKZVUtd7dEH7@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 4, 2011 at 12:02 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> Though I've not read whole of the patch yet, here is the current comment:

Here are another comments:

+#replication_timeout_client = 120 # 0 means wait forever

Typo: s/replication_timeout_client/sync_replication_timeout

+ else if (timeout > 0 &&
+ TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(),
+ wait_start, timeout))

If SetCurrentTransactionStopTimestamp() is called before (i.e., COMMIT case),
the return value of GetCurrentTransactionStopTimestamp() is the same as
"wait_start". So, in this case, the timeout never expires.

+ strcpy(new_status + len, " waiting for sync rep");
+ set_ps_display(new_status, false);

How about changing the message to something like "waiting for %X/%X"
(%X/%X indicates the LSN which the backend is waiting for)?

Please initialize MyProc->procWaitLink to NULL in InitProcess() as well as
do MyProc->lwWaitLink.

+ /*
+ * We're a potential sync standby. Release waiters if we are the
+ * highest priority standby. We do this even if the standby is not yet
+ * caught up, in case this is a restart situation and
+ * there are backends waiting for us. That allows backends to exit the
+ * wait state even if new backends cannot yet enter the wait state.
+ */

I don't think that it's good idea to switch the high priority standby which has
not caught up, to the sync one, especially when there is already another
sync standby. Because that degrades replication from sync to async for
a while, even though there is sync standby which has caught up.

+ if (walsnd->pid != 0 &&
+ walsnd->sync_standby_priority > 0 &&
+ (priority == 0 ||
+ priority < walsnd->sync_standby_priority))
+ {
+ priority = walsnd->sync_standby_priority;
+ syncWalSnd = walsnd;
+ }

According to the code, the last named standby has highest priority. But the
document says the opposite.

ISTM the waiting backends can be sent the wake-up signal by the
walsender multiple times since the walsender doesn't remove any
entry from the queue. Isn't this unsafe? waste of the cycle?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 08:34:08
Message-ID:	AANLkTik2e04oyaozgZ8hXSeVTwhsRmZPiqcJ5+1-3WEK@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 4, 2011 at 3:16 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> CommitTransaction() calls HOLD_INTERRUPT() and then RESUME_INTERRUPTS(),
> which was reasonable before we started waiting for syncrep. The
> interrupt does occur *before* we send the message back, but doesn't work
> effectively at interrupting the wait in the way you would like.
>
> If we RESUME_INTERRUPTS() prior to waiting and then HOLD again that
> would allow all signals not just SIGTERM. We would need to selectively
> reject everything except SIGTERM messages.
>
> Ideas?
>
> Alter ProcessInterrupts() to accept an interrupt if ProcDiePending &&
> WaitingForSyncRep and InterruptHoldoffCount > 0. That looks a little
> scary, but looks like it will work.

If shutdown is requested before WaitingForSyncRep is set to TRUE and
after HOLD_INTERRUPT() is called, the waiting backends cannot be
interrupted.

SIGTERM can be sent by pg_terminate_backend(). So we should check
whether shutdown is requested before emitting WARNING and closing
the connection. If it's not requested yet, I think that it's safe to return the
success indication to the client.

I think that it's safer to close the connection and terminate the backend
after cleaning all the resources. So, as I suggested before, how about
doing that in EndCommand()?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 10:21:37
Message-ID:	1299234097.10703.3615.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2011-03-04 at 17:34 +0900, Fujii Masao wrote:
> On Fri, Mar 4, 2011 at 3:16 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > CommitTransaction() calls HOLD_INTERRUPT() and then RESUME_INTERRUPTS(),
> > which was reasonable before we started waiting for syncrep. The
> > interrupt does occur *before* we send the message back, but doesn't work
> > effectively at interrupting the wait in the way you would like.
> >
> > If we RESUME_INTERRUPTS() prior to waiting and then HOLD again that
> > would allow all signals not just SIGTERM. We would need to selectively
> > reject everything except SIGTERM messages.
> >
> > Ideas?
> >
> > Alter ProcessInterrupts() to accept an interrupt if ProcDiePending &&
> > WaitingForSyncRep and InterruptHoldoffCount > 0. That looks a little
> > scary, but looks like it will work.
>
> If shutdown is requested before WaitingForSyncRep is set to TRUE and
> after HOLD_INTERRUPT() is called, the waiting backends cannot be
> interrupted.
>
> SIGTERM can be sent by pg_terminate_backend(). So we should check
> whether shutdown is requested before emitting WARNING and closing
> the connection. If it's not requested yet, I think that it's safe to return the
> success indication to the client.

I'm not sure if that matters. Nobody apart from the postmaster knows
about a shutdown. All the other processes know is that they received
SIGTERM, which as you say could have been a specific user action aimed
at an individual process.

We need a way to end the wait state explicitly, so it seems easier to
make SIGTERM the initiating action, no matter how it is received.

The alternative is to handle it this way
1) set something in shared memory
2) set latch of all backends
3) have the backends read shared memory and then end the wait

Who would do (1) and (2)? Not the backend, its sleeping, not the
postmaster its shm, nor a WALSender cos it might not be there.

Seems like a lot of effort to avoid SIGTERM. Do we have a good reason
why we need that? Might it introduce other issues?

> I think that it's safer to close the connection and terminate the backend
> after cleaning all the resources. So, as I suggested before, how about
> doing that in EndCommand()?

Yes, if we don't use SIGTERM then we would use EndCommand()

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 10:51:03
Message-ID:	1299235863.10703.3715.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2011-03-04 at 16:42 +0900, Fujii Masao wrote:
> On Fri, Mar 4, 2011 at 12:02 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > Though I've not read whole of the patch yet, here is the current comment:
>
> Here are another comments:
>
> +#replication_timeout_client = 120 # 0 means wait forever
>
> Typo: s/replication_timeout_client/sync_replication_timeout

Done

> + else if (timeout > 0 &&
> + TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(),
> + wait_start, timeout))
>
> If SetCurrentTransactionStopTimestamp() is called before (i.e., COMMIT case),
> the return value of GetCurrentTransactionStopTimestamp() is the same as
> "wait_start". So, in this case, the timeout never expires.

Don't understand (still)

> + strcpy(new_status + len, " waiting for sync rep");
> + set_ps_display(new_status, false);
>
> How about changing the message to something like "waiting for %X/%X"
> (%X/%X indicates the LSN which the backend is waiting for)?

Done

> Please initialize MyProc->procWaitLink to NULL in InitProcess() as well as
> do MyProc->lwWaitLink.

I'm rewriting that aspect now.

> + /*
> + * We're a potential sync standby. Release waiters if we are the
> + * highest priority standby. We do this even if the standby is not yet
> + * caught up, in case this is a restart situation and
> + * there are backends waiting for us. That allows backends to exit the
> + * wait state even if new backends cannot yet enter the wait state.
> + */
>
> I don't think that it's good idea to switch the high priority standby which has
> not caught up, to the sync one, especially when there is already another
> sync standby. Because that degrades replication from sync to async for
> a while, even though there is sync standby which has caught up.

OK, that wasn't really my intention. Changed.

> + if (walsnd->pid != 0 &&
> + walsnd->sync_standby_priority > 0 &&
> + (priority == 0 ||
> + priority < walsnd->sync_standby_priority))
> + {
> + priority = walsnd->sync_standby_priority;
> + syncWalSnd = walsnd;
> + }
>
> According to the code, the last named standby has highest priority. But the
> document says the opposite.

Priority is a difficult word here since "1" is the highest priority. I
deliberately avoided using the word "highest" in the code for that
reason.

The code above finds the lowest non-zero standby, which is correct as
documented.

> ISTM the waiting backends can be sent the wake-up signal by the
> walsender multiple times since the walsender doesn't remove any
> entry from the queue. Isn't this unsafe? waste of the cycle?

It's ok to set a latch that isn't set. It's unlikely to wake someone
twice before they can remove themselves.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 11:08:17
Message-ID:	1299236897.10703.3783.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2011-03-04 at 10:51 +0000, Simon Riggs wrote:

> > + else if (timeout > 0 &&
> > + TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(),
> > + wait_start, timeout))
> >
> > If SetCurrentTransactionStopTimestamp() is called before (i.e., COMMIT case),
> > the return value of GetCurrentTransactionStopTimestamp() is the same as
> > "wait_start". So, in this case, the timeout never expires.
>
> Don't understand (still)

OK, coffee has seeped into brain now, thanks.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 11:24:06
Message-ID:	4D70CBD6.2020004@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

commit 009875662e1b47012e1f4b7d30eb9e238d1937f6
Author: Simon Riggs <simon(at)2ndquadrant(dot)com>
Date: Fri Mar 4 06:13:43 2011 +0000

Allow SIGTERM messages in ProcessInterrupts() even when interrupts are
held, if WaitingForSyncRep

1) unexpected behaviour
- master has synchronous_standby_names = 'standby1,standby2,standby3'
- standby with 'standby2' connects first.
- LOG: 00000: standby "standby2" is now the synchronous standby with
priority 2

I'm still confused by the priority numbers. At first I thought that
priority 1 meant: this is the one that is currently waited for. Now I'm
not sure if this is the first potential standby that is not used, or
that it is actually the one waited for.
What I expected was that it would be connected with priority 1. And then
if the standby1 connect, it would become the one with prio1 and standby2
with prio2.

2) unexpected behaviour
- continued from above
- standby with 'asyncone' name connects next
- no log message on master

I expected a log message along the lines 'standby "asyncone" is now an
asynchronous standby'

3) more about log messages
- didn't get a log message that the asyncone standby stopped
- didn't get a log message that standby1 connected with priority 1
- after stop / start master, again only got a log that standby2
connectied with priority 2
- pg_stat_replication showed both standb1 and standby2 with correct prio#

4) More about the priority stuff. At this point I figured out prio 2 can
also be 'the real sync'. Still I'd prefer in pg_stat_replication a
boolean that clearly shows 'this is the one', with a source that is
intimately connected to the syncrep implemenation, instead of a
different implementation of 'if lowest connected priority and > 0, then
sync is true. If there are two different implementations, there is room
for differences, which doesn't feel right.

5) performance.
Seems to have dropped a a few dozen %. With v17 I earlier got ~650 tps
and after some more tuning over 900 tps. Now with roughly the same setup
I get ~ 550 tps. Both versions on the same hardware, both compiled
without debugging, and I used the same postgresql.conf start config.

I'm currently thinking about a failure test that would check if a commit
has really waited for the standby. What's the worst thing to do to a
master server? Ideas are welcome :-)

#!/bin/sh
psql -c "create a big table with generate_series"
echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger

regards,
Yeb Havinga

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 11:45:41
Message-ID:	1299239141.10703.3954.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2011-03-04 at 12:24 +0100, Yeb Havinga wrote:
> On 2011-03-03 11:53, Simon Riggs wrote:
> > Latest version of Sync Rep, which includes substantial internal changes
> > and simplifications from previous version. (25-30 changes).
> Testing more with the post v19 version from github with HEAD

Thanks

> commit 009875662e1b47012e1f4b7d30eb9e238d1937f6
> Author: Simon Riggs <simon(at)2ndquadrant(dot)com>
> Date: Fri Mar 4 06:13:43 2011 +0000
>
> Allow SIGTERM messages in ProcessInterrupts() even when interrupts are
> held, if WaitingForSyncRep
>
>
> 1) unexpected behaviour
> - master has synchronous_standby_names = 'standby1,standby2,standby3'
> - standby with 'standby2' connects first.
> - LOG: 00000: standby "standby2" is now the synchronous standby with
> priority 2
>
> I'm still confused by the priority numbers. At first I thought that
> priority 1 meant: this is the one that is currently waited for. Now I'm
> not sure if this is the first potential standby that is not used, or
> that it is actually the one waited for.
> What I expected was that it would be connected with priority 1. And then
> if the standby1 connect, it would become the one with prio1 and standby2
> with prio2.

The priority refers to the order in which that standby is listed in
synchronous_standby_names. That is not dependent upon who is currently
connected. It doesn't mean the order in which the currently connected
standbys will become the sync standby.

So the log message allows you to work out that "standby2" is connected
and will operate as sync standby until something mentioned earlier in
synchronous_standby_names, in this case standby1, connects.

> 2) unexpected behaviour
> - continued from above
> - standby with 'asyncone' name connects next
> - no log message on master
>
> I expected a log message along the lines 'standby "asyncone" is now an
> asynchronous standby'

That would introduce messages where there currently aren't any, so I
left that out. I'll put it in for clarity.

> 3) more about log messages
> - didn't get a log message that the asyncone standby stopped
OK
> - didn't get a log message that standby1 connected with priority 1
Bad
> - after stop / start master, again only got a log that standby2
> connectied with priority 2
Bad
> - pg_stat_replication showed both standb1 and standby2 with correct prio#
Good

Please send me log output at DEBUG3 offline.

> 4) More about the priority stuff. At this point I figured out prio 2 can
> also be 'the real sync'. Still I'd prefer in pg_stat_replication a
> boolean that clearly shows 'this is the one', with a source that is
> intimately connected to the syncrep implemenation, instead of a
> different implementation of 'if lowest connected priority and > 0, then
> sync is true. If there are two different implementations, there is room
> for differences, which doesn't feel right.

> 5) performance.
> Seems to have dropped a a few dozen %. With v17 I earlier got ~650 tps
> and after some more tuning over 900 tps. Now with roughly the same setup
> I get ~ 550 tps. Both versions on the same hardware, both compiled
> without debugging, and I used the same postgresql.conf start config.

Will need to re-look at performance after commit

> I'm currently thinking about a failure test that would check if a commit
> has really waited for the standby. What's the worst thing to do to a
> master server? Ideas are welcome :-)
>
> #!/bin/sh
> psql -c "create a big table with generate_series"
> echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger
>
> regards,
> Yeb Havinga
>

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 13:57:46
Message-ID:	4D70EFDA.7050903@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2011-03-04 12:24, Yeb Havinga wrote:
> I'm currently thinking about a failure test that would check if a
> commit has really waited for the standby. What's the worst thing to do
> to a master server? Ideas are welcome :-)
>
> #!/bin/sh
> psql -c "create a big table with generate_series"
> echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger
Did that with both a sync and async standby server, then promoted both
replicas.

Both replicas had the complete big table. Maybe the async server was
somehow 'saved' by the master waiting for the sync server? Test repeated
with only the async one connected.

The master then shows this at restart
LOG: 00000: record with zero length at 4/B2CD3598
LOG: 00000: redo done at 4/B2CD3558
LOG: 00000: last completed transaction was at log time 2011-03-04
14:43:31.02041+01

The async promoted server
LOG: 00000: record with zero length at 4/B2CC9260
LOG: 00000: redo done at 4/B2CC9220
LOG: 00000: last completed transaction was at log time 2011-03-04
14:43:31.018444+01

Even though the async server had the complete relation I created,
something was apparently done just before the reboot.

Test repeated with only 1 sync standby

Then on master at recovery
LOG: 00000: record with zero length at 4/D1051C88
LOG: 00000: redo done at 4/D1051C48
LOG: 00000: last completed transaction was at log time 2011-03-04
14:52:11.035188+01

on the sync promoted server
LOG: 00000: redo done at 4/D1051C48
LOG: 00000: last completed transaction was at log time 2011-03-04
14:52:11.035188+01

Nice!

regards,
Yeb Havinga

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 14:15:50
Message-ID:	AANLkTinW_vj24e=e7-aPXT+qwUmtruVN1sZHj-WMVgtP@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 4, 2011 at 7:51 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> + if (walsnd->pid != 0 &&
>> + walsnd->sync_standby_priority > 0 &&
>> + (priority == 0 ||
>> + priority < walsnd->sync_standby_priority))
>> + {
>> + priority = walsnd->sync_standby_priority;
>> + syncWalSnd = walsnd;
>> + }
>>
>> According to the code, the last named standby has highest priority. But the
>> document says the opposite.
>
> Priority is a difficult word here since "1" is the highest priority. I
> deliberately avoided using the word "highest" in the code for that
> reason.
>
> The code above finds the lowest non-zero standby, which is correct as
> documented.

Hmm.. that seems to find the highest standby. And, I could confirm
that in my box. Please see the following. The priority (= 2) of
synchronous standby (its sync_state is SYNC) is higher than that (= 1)
of potential one (its sync_state is POTENTIAL).

postgres=# SHOW synchronous_standby_names ;
synchronous_standby_names
---------------------------
one, two
(1 row)

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 15:18:38
Message-ID:	1299251918.10703.4842.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote:

Bug! Thanks.

Fixed

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 20:04:48
Message-ID:	AANLkTi=b16crDtFNU+iaQ1TGw1dPB1c9c79fFuEJ09AQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 4, 2011 at 7:21 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> SIGTERM can be sent by pg_terminate_backend(). So we should check
>> whether shutdown is requested before emitting WARNING and closing
>> the connection. If it's not requested yet, I think that it's safe to return the
>> success indication to the client.
>
> I'm not sure if that matters. Nobody apart from the postmaster knows
> about a shutdown. All the other processes know is that they received
> SIGTERM, which as you say could have been a specific user action aimed
> at an individual process.
>
> We need a way to end the wait state explicitly, so it seems easier to
> make SIGTERM the initiating action, no matter how it is received.
>
> The alternative is to handle it this way
> 1) set something in shared memory
> 2) set latch of all backends
> 3) have the backends read shared memory and then end the wait
>
> Who would do (1) and (2)? Not the backend, its sleeping, not the
> postmaster its shm, nor a WALSender cos it might not be there.
>
> Seems like a lot of effort to avoid SIGTERM. Do we have a good reason
> why we need that? Might it introduce other issues?

On the second thought...

I was totally wrong. Preventing the backend from returning the commit
when shutdown is requested doesn't help to avoid the data loss at all.
Without shutdown, the following simple scenario can cause data loss.

1. Replication connection is closed because of network outage.
2. Though replication has not been completed, the waiting backend is
released since the timeout expires. Then it returns the success to
the client.
3. The primary crashes, and then the clusterware promotes the standby
which doesn't have the latest change on the primary to new primary.
Data lost happens!

In the first place, there are two kinds of data loss:

(A) Pysical data loss
This is the case where we can never retrieve the committed data
physically. For example, if the storage of the standalone server gets
corrupted, we would lose some data forever. To avoid this type of
data loss, we would have to choose the "wait-forever" behavior. But
as I said in upthread, we can decrease the risk of this data loss to
a certain extent by spending much money on the storage. So, if that
cost is less than the cost which we have to pay when down-time
happens, we don't need to choose the "wait-forever" option.

(B) Logical data loss
This is the case where we think wrongly that the committed data
has been lost while we can actually retrieve it physically. For example,
in the above three-steps scenario, we can read all the committed data
from two servers physically even after failover. But since the client
attempts to read data only from new primary, some data looks lost to
the client. The "wait-forever" behavior can help also to avoid this type
of data loss. And, another way is to STONITH the standby before the
timeout releases any waiting backend. If so, we can completely prevent
the outdated standby from being brought up, and can avoid logical data
loss. According to my quick research, in DRBD, the "dopd (DRBD
outdate-peer daemon)" plays that role.

What I'd like to avoid is (B). Though (A) is more serious problem than (B),
we already have some techniques to decrease the risk of (A). But not
(B), I think.

The "wait-forever" might be a straightforward approach against (B). But
this option prevents transactions from running not only when the
synchronous standby goes away, but also when the primary is invoked
first or when the standby is promoted at failover. Since the availability
of the database service decreases very much, I don't want to use that.

Keeping transactions waiting in the latter two cases would be required
to avoid (A), but not (B). So I think that we can relax the "wait-forever"
option so that it allows not-replicated transactions to complete only in
those cases. IOW, when we initially start the primary, the backends
don't wait at all for new standby to connect. And, while new primary is
running alone after failover, the backends don't wait at all, too. Only
when replication connection is closed while streaming WAL to sync
standby, the backends wait until new sync standby has connected and
replication has been completed. Even in this case, if we want to
improve the service availability, we have only to make something like
dopd to STONITH the outdated standby, and then request the primary
to release the waiting backends. So I think that the interface to
request that release should be implemented.

Fortunately, that partial "wait-forever" behavior has already been
implemented in Simon's patch with the client timeout = 0 (disable).
If he implements the interface to release the waiting backends,
I'm OK with his design about when to release the backends for 9.1
(unless I'm missing something).

Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 20:27:07
Message-ID:	AANLkTiniec0XKJdNDKqu8R52BsOUOeupiRXE6vEDjtVK@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 4, 2011 at 3:04 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> The "wait-forever" might be a straightforward approach against (B). But
> this option prevents transactions from running not only when the
> synchronous standby goes away, but also when the primary is invoked
> first or when the standby is promoted at failover. Since the availability
> of the database service decreases very much, I don't want to use that.

I continue to think that wait-forever is the most sensible option. If
you want all of your data on the disks of two machines before the
commit is ack'd, I think you probably want that all the time. The
second scenario you mentioned ("when the standby is promoted at
failover") is quite easy to handle. If you don't want synchronous
replication after a standby promotion, then configure the master to do
synchronous replication and the slave not to do synchronous
replication. Similarly, if you've got an existing machine that is not
doing synchronous replication and you want to start, fire up the
standby in asynchronous mode and switch to synchronous replication
after it has fully caught up. It seems to me that we're bent on
providing a service that does synchronous replication except when it
first starts up or when the timeout expires or when the phase of the
moon is waxing gibbons, and I don't get the point of that. If I ask
for synchronous replication, I want it to be synchronous until I
explicitly turn it off. Otherwise, when I fail over, how do I know if
I've got all my transactions, or not?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 21:18:39
Message-ID:	AANLkTimoQ1PTk3p1R89EPyzGMiMLj+KFb=sHB8cJ1m+Y@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 3, 2011 at 1:23 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> The patch sets "*" as the default, so all standbys are synchronous by
>> default.
>>
>> Would you prefer it if it was blank, meaning no standbys are
>> synchronous, by default?
>
> I think * is a reasonable default.
>

Actually i would prefer to have standbys asynchronous by default...
though is true that there will be no waits until i set
synchronous_replication to on... 1) it could be confusing to see a
SYNC standby in pg_stat_replication by default when i wanted all of
them to be async, 2) also * will give priority 1 to all standbys so it
doesn't seem like a very useful out-of-the-box configuration, better
to make the dba to write the standby names in the order they want

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 21:53:59
Message-ID:	AANLkTimM-NP6iLTJG0FjyZFtnwJ1woPC6BnfUUa+2wYd@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 4, 2011 at 4:18 PM, Jaime Casanova <jaime(at)2ndquadrant(dot)com> wrote:
> On Thu, Mar 3, 2011 at 1:23 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>
>>> The patch sets "*" as the default, so all standbys are synchronous by
>>> default.
>>>
>>> Would you prefer it if it was blank, meaning no standbys are
>>> synchronous, by default?
>>
>> I think * is a reasonable default.
>>
>
> Actually i would prefer to have standbys asynchronous by default...
> though is true that there will be no waits until i set
> synchronous_replication to on... 1) it could be confusing to see a
> SYNC standby in pg_stat_replication by default when i wanted all of
> them to be async, 2) also * will give priority 1 to all standbys so it
> doesn't seem like a very useful out-of-the-box configuration, better
> to make the dba to write the standby names in the order they want

Mmm, good points.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 22:20:14
Message-ID:	4D71659E.906@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2011-03-04 22:18, Jaime Casanova wrote:
> On Thu, Mar 3, 2011 at 1:23 PM, Robert Haas<robertmhaas(at)gmail(dot)com> wrote:
>>> The patch sets "*" as the default, so all standbys are synchronous by
>>> default.
>>>
>>> Would you prefer it if it was blank, meaning no standbys are
>>> synchronous, by default?
>> I think * is a reasonable default.
>>
> Actually i would prefer to have standbys asynchronous by default...
> though is true that there will be no waits until i set
> synchronous_replication to on... 1) it could be confusing to see a
> SYNC standby in pg_stat_replication by default when i wanted all of
> them to be async,
I see no problem with * for synchronous_standby names, such that *if*
synchronous_replication = on, then all standbys are sync. Also for the
beginning experimenter with sync rep: what would you expect after only
turning 'synchronous_replication' = on? ISTM better than : you need to
change two parameters from their default to get a replica in sync mode.
> 2) also * will give priority 1 to all standbys so it
> doesn't seem like a very useful out-of-the-box configuration, better
> to make the dba to write the standby names in the order they want
>
As somebody with a usecase for two hardware-wise equal sync replicas for
the same master (and a single async replica), the whole ordering of sync
standbys is too much feature anyway, since it will cause unneccesary
'which is the sync replica' switching. Besides that, letting all syncs
have the same priority sounds like the only thing the server can do, if
the dba has not specified it explicitly. I would see it as improvement
if order in standby_names doesn't mean priority, and that priority could
be specified with another parameter (and default: all sync priorities
the same)

regards,
Yeb Havinga

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-04 22:28:42
Message-ID:	1299277722.10703.7120.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2011-03-05 at 05:04 +0900, Fujii Masao wrote:
> On Fri, Mar 4, 2011 at 7:21 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> >> SIGTERM can be sent by pg_terminate_backend(). So we should check
> >> whether shutdown is requested before emitting WARNING and closing
> >> the connection. If it's not requested yet, I think that it's safe to return the
> >> success indication to the client.
> >
> > I'm not sure if that matters. Nobody apart from the postmaster knows
> > about a shutdown. All the other processes know is that they received
> > SIGTERM, which as you say could have been a specific user action aimed
> > at an individual process.
> >
> > We need a way to end the wait state explicitly, so it seems easier to
> > make SIGTERM the initiating action, no matter how it is received.
> >
> > The alternative is to handle it this way
> > 1) set something in shared memory
> > 2) set latch of all backends
> > 3) have the backends read shared memory and then end the wait
> >
> > Who would do (1) and (2)? Not the backend, its sleeping, not the
> > postmaster its shm, nor a WALSender cos it might not be there.
> >
> > Seems like a lot of effort to avoid SIGTERM. Do we have a good reason
> > why we need that? Might it introduce other issues?
>
> On the second thought...
>
> I was totally wrong. Preventing the backend from returning the commit
> when shutdown is requested doesn't help to avoid the data loss at all.
> Without shutdown, the following simple scenario can cause data loss.
>
> 1. Replication connection is closed because of network outage.
> 2. Though replication has not been completed, the waiting backend is
> released since the timeout expires. Then it returns the success to
> the client.
> 3. The primary crashes, and then the clusterware promotes the standby
> which doesn't have the latest change on the primary to new primary.
> Data lost happens!

Yes, that can happen. As people will no doubt observe, this seems to be
an argument for wait-forever. What we actually need is a wait that lasts
longer than it takes for us to decide to failover, if the standby is
actually up and this is some kind of split brain situation. That way the
clients are still waiting when failover occurs. WAL is missing, but
since we didn't acknowledge the client we are OK to treat that situation
as if it were an abort.

> In the first place, there are two kinds of data loss:
>
> (A) Pysical data loss
> This is the case where we can never retrieve the committed data
> physically. For example, if the storage of the standalone server gets
> corrupted, we would lose some data forever. To avoid this type of
> data loss, we would have to choose the "wait-forever" behavior. But
> as I said in upthread, we can decrease the risk of this data loss to
> a certain extent by spending much money on the storage. So, if that
> cost is less than the cost which we have to pay when down-time
> happens, we don't need to choose the "wait-forever" option.
>
> (B) Logical data loss
> This is the case where we think wrongly that the committed data
> has been lost while we can actually retrieve it physically. For example,
> in the above three-steps scenario, we can read all the committed data
> from two servers physically even after failover. But since the client
> attempts to read data only from new primary, some data looks lost to
> the client. The "wait-forever" behavior can help also to avoid this type
> of data loss. And, another way is to STONITH the standby before the
> timeout releases any waiting backend. If so, we can completely prevent
> the outdated standby from being brought up, and can avoid logical data
> loss. According to my quick research, in DRBD, the "dopd (DRBD
> outdate-peer daemon)" plays that role.
>
> What I'd like to avoid is (B). Though (A) is more serious problem than (B),
> we already have some techniques to decrease the risk of (A). But not
> (B), I think.
>
> The "wait-forever" might be a straightforward approach against (B). But
> this option prevents transactions from running not only when the
> synchronous standby goes away, but also when the primary is invoked
> first or when the standby is promoted at failover. Since the availability
> of the database service decreases very much, I don't want to use that.
>
> Keeping transactions waiting in the latter two cases would be required
> to avoid (A), but not (B). So I think that we can relax the "wait-forever"
> option so that it allows not-replicated transactions to complete only in
> those cases. IOW, when we initially start the primary, the backends
> don't wait at all for new standby to connect. And, while new primary is
> running alone after failover, the backends don't wait at all, too. Only
> when replication connection is closed while streaming WAL to sync
> standby, the backends wait until new sync standby has connected and
> replication has been completed. Even in this case, if we want to
> improve the service availability, we have only to make something like
> dopd to STONITH the outdated standby, and then request the primary
> to release the waiting backends. So I think that the interface to
> request that release should be implemented.
>
> Fortunately, that partial "wait-forever" behavior has already been
> implemented in Simon's patch with the client timeout = 0 (disable).
> If he implements the interface to release the waiting backends,
> I'm OK with his design about when to release the backends for 9.1
> (unless I'm missing something).

Almost-working patch attached for the above feature. Time to stop for
the day. Patch against current repo version.

Current repo version attached here also (v20), which includes all fixes
to all known technical issues, major polishing etc..

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment	Content-Type	Size
shutdown_without_completion.v1.patch	text/x-patch	8.3 KB
sync_rep.v20.patch	text/x-patch	56.6 KB

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 07:13:11
Message-ID:	AANLkTini5Y0xB24WrP_tBvCp5NyPfLtxsVWWE=xtbFp5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Almost-working patch attached for the above feature. Time to stop for
> the day. Patch against current repo version.
>
> Current repo version attached here also (v20), which includes all fixes
> to all known technical issues, major polishing etc..

Thanks for the patch. Now the code about the wait list looks very
simpler than before! Here are the comments:

@@ -1337,6 +1352,31 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
<snip>
+ if (walsnd->pid != 0 && walsnd->state == WALSNDSTATE_STREAMING)
+ {
+ sync_priority[i] = walsnd->sync_standby_priority;

This always reports the priority of walsender in CATCHUP state as 0.
I don't think that priority needs to be reported as 0.

When new standby which has the same priority as current sync standby
connects, that new standby can switch to new sync one even though
current one is still running. This happens when the index of WalSnd slot
which new standby uses is ahead of that which current one uses. People
don't expect such an unexpected switchover, I think.

+ /*
+ * Assume the queue is ordered by LSN
+ */
+ if (XLByteLT(walsndctl->lsn, proc->waitLSN))
+ return numprocs;

The code to ensure the assumption needs to be added.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 11:04:45
Message-ID:	1299323085.10703.10996.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2011-03-05 at 16:13 +0900, Fujii Masao wrote:
> On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > Almost-working patch attached for the above feature. Time to stop for
> > the day. Patch against current repo version.
> >
> > Current repo version attached here also (v20), which includes all fixes
> > to all known technical issues, major polishing etc..
>
> Thanks for the patch. Now the code about the wait list looks very
> simpler than before! Here are the comments:
>
> @@ -1337,6 +1352,31 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
> <snip>
> + if (walsnd->pid != 0 && walsnd->state == WALSNDSTATE_STREAMING)
> + {
> + sync_priority[i] = walsnd->sync_standby_priority;
>
> This always reports the priority of walsender in CATCHUP state as 0.
> I don't think that priority needs to be reported as 0.

Cosmetic change. We can do this, yes.

> When new standby which has the same priority as current sync standby
> connects, that new standby can switch to new sync one even though
> current one is still running. This happens when the index of WalSnd slot
> which new standby uses is ahead of that which current one uses. People
> don't expect such an unexpected switchover, I think.

It is documented that the selection of standby from a set of similar
priorities is indeterminate. Users don't like it, they can change it.

> + /*
> + * Assume the queue is ordered by LSN
> + */
> + if (XLByteLT(walsndctl->lsn, proc->waitLSN))
> + return numprocs;
>
> The code to ensure the assumption needs to be added.

Yes, just need to add the code for traversing list backwards.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 11:08:42
Message-ID:	AANLkTikZwWkRxf2tYMc213Gk8U1xvtivvwN2xVmOw-mx@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Yes, that can happen. As people will no doubt observe, this seems to be
> an argument for wait-forever. What we actually need is a wait that lasts
> longer than it takes for us to decide to failover, if the standby is
> actually up and this is some kind of split brain situation. That way the
> clients are still waiting when failover occurs. WAL is missing, but
> since we didn't acknowledge the client we are OK to treat that situation
> as if it were an abort.

Oracle Data Guard in the maximum availability mode behaves that way?

I'm sure that you are implementing something like the maximum availability
mode rather than the maximum protection one. So I'd like to know how
the data loss situation I described can be avoided in the maximum availability
mode.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 12:21:26
Message-ID:	1299327686.10703.11428.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2011-03-05 at 11:04 +0000, Simon Riggs wrote:
> > + /*
> > + * Assume the queue is ordered by LSN
> > + */
> > + if (XLByteLT(walsndctl->lsn, proc->waitLSN))
> > + return numprocs;
> >
> > The code to ensure the assumption needs to be added.
>
> Yes, just need to add the code for traversing list backwards.

I've added code to shmqueue.c to allow this.

New version pushed.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 12:24:46
Message-ID:	AANLkTikU8bBEb+k4GBKxhrqJYtHJ_bLHYZepgTrSyVaw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 5, 2011 at 6:04 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> It is documented that the selection of standby from a set of similar
> priorities is indeterminate. Users don't like it, they can change it.

That doesn't seem like a good argument to *change* the synchronous
standby once it's already set.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 12:49:46
Message-ID:	1299329386.10703.11610.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2011-03-05 at 07:24 -0500, Robert Haas wrote:
> On Sat, Mar 5, 2011 at 6:04 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > It is documented that the selection of standby from a set of similar
> > priorities is indeterminate. Users don't like it, they can change it.
>
> That doesn't seem like a good argument to *change* the synchronous
> standby once it's already set.

If the order is arbitrary, why does it matter if it changes?

The user has the power to specify a sequence, yet they have not done so.
They are told the results are indeterminate, which is accurate. I can
add the words "and may change as new standbys connect" if that helps.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 13:05:31
Message-ID:	AANLkTi=AuKLJCSD7TcTHiO-JqseR1ybJM4KZ8H_LNFxg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 5, 2011 at 7:49 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Sat, 2011-03-05 at 07:24 -0500, Robert Haas wrote:
>> On Sat, Mar 5, 2011 at 6:04 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> > It is documented that the selection of standby from a set of similar
>> > priorities is indeterminate. Users don't like it, they can change it.
>>
>> That doesn't seem like a good argument to *change* the synchronous
>> standby once it's already set.
>
> If the order is arbitrary, why does it matter if it changes?
>
> The user has the power to specify a sequence, yet they have not done so.
> They are told the results are indeterminate, which is accurate. I can
> add the words "and may change as new standbys connect" if that helps.

I just don't think that's very useful behavior. Suppose I have a
master and two standbys. Both are local (or both are remote with
equally good connectivity). When one of the standbys goes down, there
will be a hiccup (i.e. transactions will block trying to commit) until
that guy falls off and the other one takes over. Now, when he comes
back up again, I don't want the synchronous standby to change again;
that seems like a recipe for another hiccup. I think "who the current
synchronous standby is" should act as a tiebreak.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 13:44:26
Message-ID:	AANLkTi=krMcai6B7q4bTGssW5uWX1br9mz7CANDENMSR@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 5, 2011 at 2:05 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sat, Mar 5, 2011 at 7:49 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > If the order is arbitrary, why does it matter if it changes?
> >
> > The user has the power to specify a sequence, yet they have not done so.
> > They are told the results are indeterminate, which is accurate. I can
> > add the words "and may change as new standbys connect" if that helps.
>
> I just don't think that's very useful behavior. Suppose I have a
> master and two standbys. Both are local (or both are remote with
> equally good connectivity). When one of the standbys goes down, there
> will be a hiccup (i.e. transactions will block trying to commit) until
> that guy falls off and the other one takes over. Now, when he comes
> back up again, I don't want the synchronous standby to change again;
> that seems like a recipe for another hiccup. I think "who the current
> synchronous standby is" should act as a tiebreak.
>

TLDR part:

The first one might be noticed by users because it takes tens of seconds
before the sync switch. The second hiccup is hardly noticable. However
limiting the # switches of sync standby to the absolute minimum is also good
if e.g. (if there was a hook for it) cluster middleware is notified of the
sync replica change. That might either introduce a race condition or be even
completely unreliable if the notify is sent asynchronous, or it might
introduce a longer lag if the master waits for confirmation of the sync
replica change message. At that point sync replica changes become more
expensive than they are currently.

regards,
Yeb Havinga

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 15:07:45
Message-ID:	1299337665.10703.12575.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2011-03-05 at 14:44 +0100, Yeb Havinga wrote:
> On Sat, Mar 5, 2011 at 2:05 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> On Sat, Mar 5, 2011 at 7:49 AM, Simon Riggs
> <simon(at)2ndquadrant(dot)com> wrote:
> > If the order is arbitrary, why does it matter if it changes?
> >
> > The user has the power to specify a sequence, yet they have
> not done so.
> > They are told the results are indeterminate, which is
> accurate. I can
> > add the words "and may change as new standbys connect" if
> that helps.
>
>
> I just don't think that's very useful behavior. Suppose I
> have a
> master and two standbys. Both are local (or both are remote
> with
> equally good connectivity). When one of the standbys goes
> down, there
> will be a hiccup (i.e. transactions will block trying to
> commit) until
> that guy falls off and the other one takes over. Now, when he
> comes
> back up again, I don't want the synchronous standby to change
> again;
> that seems like a recipe for another hiccup. I think "who the
> current
> synchronous standby is" should act as a tiebreak.
>
>
> +1
>
> TLDR part:
>
> The first one might be noticed by users because it takes tens of
> seconds before the sync switch. The second hiccup is hardly noticable.
> However limiting the # switches of sync standby to the absolute
> minimum is also good if e.g. (if there was a hook for it) cluster
> middleware is notified of the sync replica change. That might either
> introduce a race condition or be even completely unreliable if the
> notify is sent asynchronous, or it might introduce a longer lag if the
> master waits for confirmation of the sync replica change message. At
> that point sync replica changes become more expensive than they are
> currently.

I'm not in favour.

If the user has a preferred order, they can specify it. If there is no
preferred order, how will we maintain that order?

What are the rules for maintaining this arbitrary order?

No, this is not something we need prior to commit, if ever.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 15:26:35
Message-ID:	1299338795.10703.12702.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2011-03-05 at 20:08 +0900, Fujii Masao wrote:
> On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > Yes, that can happen. As people will no doubt observe, this seems to be
> > an argument for wait-forever. What we actually need is a wait that lasts
> > longer than it takes for us to decide to failover, if the standby is
> > actually up and this is some kind of split brain situation. That way the
> > clients are still waiting when failover occurs. WAL is missing, but
> > since we didn't acknowledge the client we are OK to treat that situation
> > as if it were an abort.
>
> Oracle Data Guard in the maximum availability mode behaves that way?
>
> I'm sure that you are implementing something like the maximum availability
> mode rather than the maximum protection one. So I'd like to know how
> the data loss situation I described can be avoided in the maximum availability
> mode.

This is important, so I am taking time to formulate a full reply.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 15:42:20
Message-ID:	AANLkTinpg4MW6wdsKU2FkQ=rN8K2T_92s8ec-i5Ry=+b@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 5, 2011 at 9:21 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> I've added code to shmqueue.c to allow this.
>
> New version pushed.

New comments;

It looks odd to report the sync_state of walsender in BACKUP
state as ASYNC.

+SyncRepCleanupAtProcExit(int code, Datum arg)
+{
+ if (WaitingForSyncRep && !SHMQueueIsDetached(&(MyProc->syncrep_links)))
+ {
+ LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ SHMQueueDelete(&(MyProc->syncrep_links));
+ LWLockRelease(SyncRepLock);
+ }
+
+ if (MyProc != NULL)
+ DisownLatch(&MyProc->waitLatch);

Can MyProc really be NULL here? If yes, "MyProc != NULL" should be
checked before seeing MyProc->syncrep_links.

Even though postmaster dies, the waiting backend keeps waiting until
the timeout expires. Instead, the backends should periodically check
whether postmaster is alive, and then they should exit immediately
if it's not alive, as well as other process does? If the timeout is
disabled, such backends would get stuck infinitely.

Though I commented about the issue related to shutdown, that was
pointless. So change of ProcessInterrupts is not required unless we
find the need again. Sorry for the noise..

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 16:17:29
Message-ID:	AANLkTimCtke=G0YKakCDJoNMCnX6gc0LwB1_QrdXgT+x@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> I'm not in favour.
>
> If the user has a preferred order, they can specify it. If there is no
> preferred order, how will we maintain that order?
>
> What are the rules for maintaining this arbitrary order?

Probably what Robert, Yeb and I think is to leave the current
sync standby in sync mode until either its connection is closed
or higher priority standby connects. No complicated rule is
required.

To do that, how about tracking which standby is currently in
sync mode? Each walsender checks whether its priority is
higher than that of current sync one, and if yes, it takes over.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sync Rep v19
Date:	2011-03-05 16:31:12
Message-ID:	4033E806-F85B-4C86-BC8B-FDE15B7A8C16@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mar 5, 2011, at 11:17 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> I'm not in favour.
>>
>> If the user has a preferred order, they can specify it. If there is no
>> preferred order, how will we maintain that order?
>>
>> What are the rules for maintaining this arbitrary order?
>
> Probably what Robert, Yeb and I think is to leave the current
> sync standby in sync mode until either its connection is closed
> or higher priority standby connects. No complicated rule is
> required.
>
> To do that, how about tracking which standby is currently in
> sync mode? Each walsender checks whether its priority is
> higher than that of current sync one, and if yes, it takes over.

That is precisely what I would expect to happen, and IMHO quite useful.

...Robert

From:	Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 16:42:13
Message-ID:	AANLkTikvJD9bWxm7wSYSeu-pA1yHdF6Pw+ERB8jKbj--@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

El 05/03/2011 11:18, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com> escribió:
>
> On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon(at)2ndquadrant(dot)com>
wrote:
> > I'm not in favour.
> >
> > If the user has a preferred order, they can specify it. If there is no
> > preferred order, how will we maintain that order?
> >
> > What are the rules for maintaining this arbitrary order?
>
> Probably what Robert, Yeb and I think is to leave the current
> sync standby in sync mode until either its connection is closed
> or higher priority standby connects. No complicated rule is
> required.
>

It's not better to remove the code to manage * in synchronous_standby_names?
Once we do that there is no chance of having 2 standbys with the same
priority.

After all, most of the times the dba will need to change the * for a real
list of names anyway. At least in IMHO

--
Jaime Casanova www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 16:49:43
Message-ID:	1299343783.10703.13295.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2011-03-06 at 01:17 +0900, Fujii Masao wrote:
> On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > I'm not in favour.
> >
> > If the user has a preferred order, they can specify it. If there is no
> > preferred order, how will we maintain that order?
> >
> > What are the rules for maintaining this arbitrary order?
>
> Probably what Robert, Yeb and I think is to leave the current
> sync standby in sync mode until either its connection is closed
> or higher priority standby connects. No complicated rule is
> required.

No, it is complex. The code is intentionally stateless, so unless you
have a rule you cannot work out which one is sync standby.

It is much more important to have robust takeover behaviour.

Changing this will require rethinking how that takeover works. And I'm
not doing that for something that is documented as "indeterminate".

If you care about the sequence then set the supplied parameter, which I
have gone to some trouble to provide.

> To do that, how about tracking which standby is currently in
> sync mode? Each walsender checks whether its priority is
> higher than that of current sync one, and if yes, it takes over.
>
> Regards,
>

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 16:51:18
Message-ID:	1299343878.10703.13304.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2011-03-05 at 11:42 -0500, Jaime Casanova wrote:
> El 05/03/2011 11:18, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com> escribió:
> >
> > On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon(at)2ndquadrant(dot)com>
> wrote:
> > > I'm not in favour.
> > >
> > > If the user has a preferred order, they can specify it. If there
> is no
> > > preferred order, how will we maintain that order?
> > >
> > > What are the rules for maintaining this arbitrary order?
> >
> > Probably what Robert, Yeb and I think is to leave the current
> > sync standby in sync mode until either its connection is closed
> > or higher priority standby connects. No complicated rule is
> > required.
> >
>
> It's not better to remove the code to manage * in
> synchronous_standby_names? Once we do that there is no chance of
> having 2 standbys with the same priority.

Yes, there is, because we don't do duplicate name checking.

I've changed the default so it is no longer "*" by default, to avoid
complaints.

> After all, most of the times the dba will need to change the * for a
> real list of names anyway. At least in IMHO
>
> --
> Jaime Casanova www.2ndQuadrant.com
>

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 16:53:20
Message-ID:	1299344000.10703.13319.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

It can't. (Oracle or otherwise...)

Once we begin waiting for sync rep, if the transaction or backend ends
then other backends will be able to see the changed data. The only way
to prevent that is to shutdown the database to ensure that no readers or
writers have access to that.

Oracle's protection mechanism is to shutdown the primary if there is no
sync standby available. Maximum Protection. Any other mode must
therefore be less than maximum protection, according to Oracle, and me.
"Available" here means one that has not timed out, via parameter.

Shutting down the main server is cool, as long as you failover to one of
the standbys. If there aren't any standbys, or you don't have a
mechanism for switching quickly, you have availability problems.

What shutting down the server doesn't do is keep the data safe for
transactions that were in their commit-wait phase when the disconnect
occurs. That data exists, yet will not have been transferred to the
standby.

>From now, I also say we should wait forever. It is the safest mode and I
want no argument about whether sync rep is safe or not. We can introduce
a more relaxed mode later with high availability for the primary. That
is possible and in some cases desirable.

Now, when we lose last sync standby we have three choices:

1. reconnect the standby, or wait for a potential standby to catchup

2. immediate shutdown of master and failover to one of the standbys

3. reclassify an async standby as a sync standby

More than likely we would attempt to do (1) for a while, then do (2)

This means that when we startup the primary will freeze for a while
until the sync standbys connect. Which is OK, since they try to
reconnect every 5 seconds and we don't plan on shutting down the primary
much anyway.

I've removed the timeout parameter, plus if we begin waiting we wait
until released, forever if that's how long it takes.

The recommendation to use more than one standby remains.

Fast shutdown will wake backends from their latch and there isn't any
changed interrupt behaviour any more.

synchronous_standby_names = '*' is no longer the default

On a positive note this is one less parameter and will improve
performance as well.

All above changes made.

Ready to commit, barring concrete objections to important behaviour.

I will do one final check tomorrow evening then commit.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment	Content-Type	Size
sync_rep.v21.context.patch	text/x-patch	58.8 KB
sync_rep.v21.patch	text/x-patch	52.6 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 16:56:19
Message-ID:	1299344179.10703.13340.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2011-03-06 at 00:42 +0900, Fujii Masao wrote:
> On Sat, Mar 5, 2011 at 9:21 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > I've added code to shmqueue.c to allow this.
> >
> > New version pushed.
>
> New comments;

None of the requested changes are in v21, as yet.

> It looks odd to report the sync_state of walsender in BACKUP
> state as ASYNC.

Cool.

> +SyncRepCleanupAtProcExit(int code, Datum arg)
> +{
> + if (WaitingForSyncRep && !SHMQueueIsDetached(&(MyProc->syncrep_links)))
> + {
> + LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
> + SHMQueueDelete(&(MyProc->syncrep_links));
> + LWLockRelease(SyncRepLock);
> + }
> +
> + if (MyProc != NULL)
> + DisownLatch(&MyProc->waitLatch);
>
> Can MyProc really be NULL here? If yes, "MyProc != NULL" should be
> checked before seeing MyProc->syncrep_links.

> Even though postmaster dies, the waiting backend keeps waiting until
> the timeout expires. Instead, the backends should periodically check
> whether postmaster is alive, and then they should exit immediately
> if it's not alive, as well as other process does? If the timeout is
> disabled, such backends would get stuck infinitely.

Will wake them every 60 seconds

> Though I commented about the issue related to shutdown, that was
> pointless. So change of ProcessInterrupts is not required unless we
> find the need again. Sorry for the noise..

Yep, all gone now.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 16:58:31
Message-ID:	AANLkTimGRV+czVDo2B_pLZ-gUprJ3VvMvyFyK1LQGCiW@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 6, 2011 at 12:42 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> New comments;

Another one;

+ long timeout = SyncRepGetWaitTimeout();
<snip>
+ else if (timeout > 0 &&
+ TimestampDifferenceExceeds(wait_start, now, timeout))
+ {

The third argument of TimestampDifferenceExceeds is
expressed in milliseconds. But you wrongly give that the
microseconds.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 17:25:31
Message-ID:	AANLkTimkN6Eu1_6COLp2yrU2JmjSTB9vLi1BBFzvhgas@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 5, 2011 at 5:53 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>
> On a positive note this is one less parameter and will improve
> performance as well.
>
> All above changes made.
>
> Ready to commit, barring concrete objections to important behaviour.
>
> I will do one final check tomorrow evening then commit.
>

Will retest with new version this evening. Also curious to performance
improvement, since v17 seems to be topscorer in that department.

regardd,
Yeb Havinga

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 17:59:50
Message-ID:	AANLkTikfk9FfzHas1P0A1rK06gfSmseQbfHmkkzCth7d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 5, 2011 at 11:56 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Even though postmaster dies, the waiting backend keeps waiting until
>> the timeout expires. Instead, the backends should periodically check
>> whether postmaster is alive, and then they should exit immediately
>> if it's not alive, as well as other process does? If the timeout is
>> disabled, such backends would get stuck infinitely.
>
> Will wake them every 60 seconds

I don't really see why sync rep should be responsible for solving this
problem, which is an issue in many other situations as well, only for
itself. In fact I think I'd prefer that it didn't, and that we wait
for a more general solution that will actually fix this problem for
real.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 20:11:36
Message-ID:	4D7298F8.9070905@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2011-03-05 18:25, Yeb Havinga wrote:
> On Sat, Mar 5, 2011 at 5:53 PM, Simon Riggs <simon(at)2ndquadrant(dot)com
> <mailto:simon(at)2ndquadrant(dot)com>> wrote:
>
>
> On a positive note this is one less parameter and will improve
> performance as well.
>
> All above changes made.
>
> Ready to commit, barring concrete objections to important behaviour.
>
> I will do one final check tomorrow evening then commit.
>
>
> Will retest with new version this evening. Also curious to performance
> improvement, since v17 seems to be topscorer in that department.
Summary of preliminary testing:
1) it is confusing to show messages/ contents of stat_replication that
hints at syncrep, when synchronous_replication is on.
2) should guc settings for synchronous_replication be changed so it can
only be set in the config file, and hence only change with reload_conf()?
3) speed is comparable to v17 :-)

regards,
Yeb Havinga

So the biggest change is perhaps that you cannot start 'working'
immediately after a initdb with synchronous_replication=on, without a
connected standby; I needed to create a role for the repuser to make a
backup, but the master halted. The initial bootstrapping has to be done
with synchronous_replication = off. So I did, made backup, started
standby's while still in not synchronous mode. What followed was confusing:

LOG: 00000: standby "standby2" is now the synchronous standby with
priority 2

postgres=# show synchronous_replication ; show
synchronous_standby_names; select application_name,state,sync_state from
pg_stat_replication;
synchronous_replication
-------------------------
off
(1 row)
synchronous_standby_names
----------------------------
standby1,standby2,standby3
(1 row)
application_name | state | sync_state
------------------+-----------+------------
standby2 | STREAMING | SYNC
asyncone | STREAMING | ASYNC
(2 rows)

Is it really sync?
pgbench test got 1464 tps.. seems a bit high.

psql postgres, set synchronous_replication = on;
- no errors, and show after disconnect showed this parameter was still
on. My guess: we have syncrep! A restart or reload config was not necessary.
pgbench test got 1460 tps.

pg_reload_conf(); with syncrep = on in postgresql.conf
pgbench test got 819 tps

So now this is synchronous.
Disabled the asynchronous standby
pgbench test got 920 tps.

I also got a first first > 1000 tps score :-) (yeah you have to believe
me there really was a sync standby server)

$ pgbench -c 10 -M prepared -T 30 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 50
query mode: prepared
number of clients: 10
number of threads: 1
duration: 30 s
number of transactions actually processed: 30863
tps = 1027.493807 (including connections establishing)
tps = 1028.183618 (excluding connections establishing)

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 20:17:56
Message-ID:	4D729A74.5090704@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2011-03-05 21:11, Yeb Havinga wrote:
> Summary of preliminary testing:
> 1) it is confusing to show messages/ contents of stat_replication that
> hints at syncrep, when synchronous_replication is on.
s/on/off/

Also forgot to mention these tests are againt the latest v21 syncrep patch.

From:	Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-05 22:10:29
Message-ID:	AANLkTimic_A7nMZ5mS2Au+UqBWWwp-8o+v8nezcW9-if@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 5, 2011 at 3:11 PM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
>
> Summary of preliminary testing:
> 1) it is confusing to show messages/ contents of stat_replication that hints
> at syncrep, when synchronous_replication is on.

[for the record, Yeb explain he means OFF not on...]

the thing is that once you put a server name in
synchronous_standby_names that standby is declared to be synchronous
and because synchronous_replication can change at any time for any
given backend we need to know priority of any declared sync standby.
if you want pure async standbys remove their names from
synchronous_standby_names

> 2) should guc settings for synchronous_replication be changed so it can only
> be set in the config file, and hence only change with reload_conf()?

no. the thing is that we can determine for which data we want to pay
the price of synchrony and for which data don't

> 3) speed is comparable to v17 :-)
>

yeah... it's a lot better than before, good work Simon :)

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-06 05:27:49
Message-ID:	AANLkTimO4d7kNQJMv7-rBPAnnShG1gO=dMkSeyhKz2q-@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 6, 2011 at 2:59 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Mar 5, 2011 at 11:56 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> Even though postmaster dies, the waiting backend keeps waiting until
>>> the timeout expires. Instead, the backends should periodically check
>>> whether postmaster is alive, and then they should exit immediately
>>> if it's not alive, as well as other process does? If the timeout is
>>> disabled, such backends would get stuck infinitely.
>>
>> Will wake them every 60 seconds
>
> I don't really see why sync rep should be responsible for solving this
> problem, which is an issue in many other situations as well, only for
> itself. In fact I think I'd prefer that it didn't, and that we wait
> for a more general solution that will actually fix this problem for
> real.

I agree if such a general solution will be committed together with sync rep.
Otherwise, because of sync rep, the backend can easily get stuck *infinitely*.
When postmaster is not alive, all the existing walsenders exit immediately
and no new walsender can appear. So there is no way to release the
waiting backend. I think that some solutions for this issue which is likely to
happen are required.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-06 07:51:05
Message-ID:	AANLkTimCgX2gysx0TbXtf-d605zz1CFCOahS1MKcCet7@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 6, 2011 at 1:53 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Sat, 2011-03-05 at 20:08 +0900, Fujii Masao wrote:
>> On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> > Yes, that can happen. As people will no doubt observe, this seems to be
>> > an argument for wait-forever. What we actually need is a wait that lasts
>> > longer than it takes for us to decide to failover, if the standby is
>> > actually up and this is some kind of split brain situation. That way the
>> > clients are still waiting when failover occurs. WAL is missing, but
>> > since we didn't acknowledge the client we are OK to treat that situation
>> > as if it were an abort.
>>
>> Oracle Data Guard in the maximum availability mode behaves that way?
>>
>> I'm sure that you are implementing something like the maximum availability
>> mode rather than the maximum protection one. So I'd like to know how
>> the data loss situation I described can be avoided in the maximum availability
>> mode.
>
> It can't. (Oracle or otherwise...)
>
> Once we begin waiting for sync rep, if the transaction or backend ends
> then other backends will be able to see the changed data. The only way
> to prevent that is to shutdown the database to ensure that no readers or
> writers have access to that.
>
> Oracle's protection mechanism is to shutdown the primary if there is no
> sync standby available. Maximum Protection. Any other mode must
> therefore be less than maximum protection, according to Oracle, and me.
> "Available" here means one that has not timed out, via parameter.
>
> Shutting down the main server is cool, as long as you failover to one of
> the standbys. If there aren't any standbys, or you don't have a
> mechanism for switching quickly, you have availability problems.
>
> What shutting down the server doesn't do is keep the data safe for
> transactions that were in their commit-wait phase when the disconnect
> occurs. That data exists, yet will not have been transferred to the
> standby.
>
> >From now, I also say we should wait forever. It is the safest mode and I
> want no argument about whether sync rep is safe or not. We can introduce
> a more relaxed mode later with high availability for the primary. That
> is possible and in some cases desirable.
>
> Now, when we lose last sync standby we have three choices:
>
> 1. reconnect the standby, or wait for a potential standby to catchup
>
> 2. immediate shutdown of master and failover to one of the standbys
>
> 3. reclassify an async standby as a sync standby
>
> More than likely we would attempt to do (1) for a while, then do (2)
>
> This means that when we startup the primary will freeze for a while
> until the sync standbys connect. Which is OK, since they try to
> reconnect every 5 seconds and we don't plan on shutting down the primary
> much anyway.
>
> I've removed the timeout parameter, plus if we begin waiting we wait
> until released, forever if that's how long it takes.
>
> The recommendation to use more than one standby remains.
>
> Fast shutdown will wake backends from their latch and there isn't any
> changed interrupt behaviour any more.
>
> synchronous_standby_names = '*' is no longer the default
>
> On a positive note this is one less parameter and will improve
> performance as well.
>
> All above changes made.
>
> Ready to commit, barring concrete objections to important behaviour.
>
> I will do one final check tomorrow evening then commit.

I agree with this change.

One comment; what about introducing built-in function to wake up all the
waiting backends? When replication connection is closed, if we STONITH
the standby, we can safely (for not physical data loss but logical one)
switch the primary to standalone mode. But there is no way to wake up
the waiting backends for now. Setting synchronous_replication to OFF
and reloading the configuration file doesn't affect the existing waiting
backends. The attached patch introduces the "pg_wakeup_all_waiters"
(better name?) function which wakes up all the backends on the queue.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment	Content-Type	Size
wakeup_waiters_v1.patch	text/x-diff	5.9 KB

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-06 07:58:19
Message-ID:	AANLkTimaPDuViRMxM0iptvCEizGv2-fgDLO-Sa28d+ar@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 6, 2011 at 4:51 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> One comment; what about introducing built-in function to wake up all the
> waiting backends? When replication connection is closed, if we STONITH
> the standby, we can safely (for not physical data loss but logical one)
> switch the primary to standalone mode. But there is no way to wake up
> the waiting backends for now. Setting synchronous_replication to OFF
> and reloading the configuration file doesn't affect the existing waiting
> backends. The attached patch introduces the "pg_wakeup_all_waiters"
> (better name?) function which wakes up all the backends on the queue.

If unfortunately all connection slots are used by backends waiting for
replication, we cannot execute such a function. So it makes more sense
to introduce something like "pg_ctl standalone" command?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-06 08:21:13
Message-ID:	1299399673.1696.30.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2011-03-06 at 14:27 +0900, Fujii Masao wrote:
> On Sun, Mar 6, 2011 at 2:59 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > On Sat, Mar 5, 2011 at 11:56 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> >>> Even though postmaster dies, the waiting backend keeps waiting until
> >>> the timeout expires. Instead, the backends should periodically check
> >>> whether postmaster is alive, and then they should exit immediately
> >>> if it's not alive, as well as other process does? If the timeout is
> >>> disabled, such backends would get stuck infinitely.
> >>
> >> Will wake them every 60 seconds
> >
> > I don't really see why sync rep should be responsible for solving this
> > problem, which is an issue in many other situations as well, only for
> > itself. In fact I think I'd prefer that it didn't, and that we wait
> > for a more general solution that will actually fix this problem for
> > real.
>
> I agree if such a general solution will be committed together with sync rep.
> Otherwise, because of sync rep, the backend can easily get stuck *infinitely*.
> When postmaster is not alive, all the existing walsenders exit immediately
> and no new walsender can appear. So there is no way to release the
> waiting backend. I think that some solutions for this issue which is likely to
> happen are required.

Completely agree.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-06 08:26:18
Message-ID:	1299399978.1696.64.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2011-03-06 at 16:58 +0900, Fujii Masao wrote:
> On Sun, Mar 6, 2011 at 4:51 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > One comment; what about introducing built-in function to wake up all the
> > waiting backends? When replication connection is closed, if we STONITH
> > the standby, we can safely (for not physical data loss but logical one)
> > switch the primary to standalone mode. But there is no way to wake up
> > the waiting backends for now. Setting synchronous_replication to OFF
> > and reloading the configuration file doesn't affect the existing waiting
> > backends. The attached patch introduces the "pg_wakeup_all_waiters"
> > (better name?) function which wakes up all the backends on the queue.
>
> If unfortunately all connection slots are used by backends waiting for
> replication, we cannot execute such a function. So it makes more sense
> to introduce something like "pg_ctl standalone" command?

Well, there is one way to end the wait: shutdown, or use
pg_terminate_backend().

If you simply end the wait you will get COMMIT messages.

What I would like to do is commit the "safe" patch now. We can then
discuss whether it is safe and desirable to relax some aspects of that
during beta.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-06 08:27:01
Message-ID:	1299400021.1696.72.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2011-03-06 at 01:58 +0900, Fujii Masao wrote:
> On Sun, Mar 6, 2011 at 12:42 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > New comments;
>
> Another one;
>
> + long timeout = SyncRepGetWaitTimeout();
> <snip>
> + else if (timeout > 0 &&
> + TimestampDifferenceExceeds(wait_start, now, timeout))
> + {
>
> The third argument of TimestampDifferenceExceeds is
> expressed in milliseconds. But you wrongly give that the
> microseconds.

Just for the record: that code section is now removed in v21

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-06 14:26:22
Message-ID:	AANLkTinNZjF7KMWMP=mTViADhEjqzn1iHSRqNUJk1kEY@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 6, 2011 at 5:26 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Sun, 2011-03-06 at 16:58 +0900, Fujii Masao wrote:
>> On Sun, Mar 6, 2011 at 4:51 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> > One comment; what about introducing built-in function to wake up all the
>> > waiting backends? When replication connection is closed, if we STONITH
>> > the standby, we can safely (for not physical data loss but logical one)
>> > switch the primary to standalone mode. But there is no way to wake up
>> > the waiting backends for now. Setting synchronous_replication to OFF
>> > and reloading the configuration file doesn't affect the existing waiting
>> > backends. The attached patch introduces the "pg_wakeup_all_waiters"
>> > (better name?) function which wakes up all the backends on the queue.
>>
>> If unfortunately all connection slots are used by backends waiting for
>> replication, we cannot execute such a function. So it makes more sense
>> to introduce something like "pg_ctl standalone" command?
>
> Well, there is one way to end the wait: shutdown, or use
> pg_terminate_backend().

Immediate shutdown can release the wait. But smart and fast shutdown
cannot. Also pg_terminate_backend() cannot. Since a backend is waiting
on the latch and InterruptHoldoffCount is not zero, only SetLatch() or
SIGQUIT can cause it to end.

> If you simply end the wait you will get COMMIT messages.
>
> What I would like to do is commit the "safe" patch now. We can then
> discuss whether it is safe and desirable to relax some aspects of that
> during beta.

OK if changing some aspects is acceptable during beta.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sync Rep v19
Date:	2011-03-06 14:44:58
Message-ID:	AANLkTimdtaZqYZuuQKo_s2iZMFs04J-geB1aX+2rGBxJ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 6, 2011 at 5:02 PM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
> On Sun, Mar 6, 2011 at 8:58 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>
>> If unfortunately all connection slots are used by backends waiting for
>> replication, we cannot execute such a function. So it makes more sense
>> to introduce something like "pg_ctl standalone" command?
>
> If it is only for shutdown, maybe pg_ctl stop -m standalone?

It's for not only shutdown but also running the primary in standalone mode.
So something like "pg_ctl standalone" is better.

For now I think that pg_ctl command is better than built-in function because
sometimes we might want to wake waiters up even during shutdown in
order to cause shutdown to end. During shutdown, the server doesn't
accept any new connection (even from the standby). So, without something
like "pg_ctl standalone", there is no way to cause shutdown to end.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-06 16:14:33
Message-ID:	AANLkTi=uoe1FmfU8PxqcOfvnCr8L2pGRskZktBrXxo8t@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

El 06/03/2011 03:26, "Simon Riggs" <simon(at)2ndquadrant(dot)com> escribió:
>
> On Sun, 2011-03-06 at 16:58 +0900, Fujii Masao wrote:

> > If unfortunately all connection slots are used by backends waiting for
> > replication, we cannot execute such a function. So it makes more sense
> > to introduce something like "pg_ctl standalone" command?
>
> Well, there is one way to end the wait: shutdown, or use
> pg_terminate_backend().
>

I disconnected all standbys so the master keeps waiting on commit. Then i
shutdown the master with immediate and got a crash. i wasn't able to trace
that though

--
Jaime Casanova www.2ndQuadrant.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sync Rep v19
Date:	2011-03-06 19:54:28
Message-ID:	72387015-6E0E-4271-8BBE-8B22DA23E12D@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mar 6, 2011, at 9:44 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Sun, Mar 6, 2011 at 5:02 PM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
>> On Sun, Mar 6, 2011 at 8:58 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>
>>> If unfortunately all connection slots are used by backends waiting for
>>> replication, we cannot execute such a function. So it makes more sense
>>> to introduce something like "pg_ctl standalone" command?
>>
>> If it is only for shutdown, maybe pg_ctl stop -m standalone?
>
> It's for not only shutdown but also running the primary in standalone mode.
> So something like "pg_ctl standalone" is better.
>
> For now I think that pg_ctl command is better than built-in function because
> sometimes we might want to wake waiters up even during shutdown in
> order to cause shutdown to end. During shutdown, the server doesn't
> accept any new connection (even from the standby). So, without something
> like "pg_ctl standalone", there is no way to cause shutdown to end.

This sounds like an awful hack to work around a bad design. Surely once shutdown reaches a point where new replication connections can no longer be accepted, any standbys hung on commit need to close the connection without responding to the COMMIT, per previous discussion. It's completely unreasonable for sync rep to break the shutdown sequence.

...Robert

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-06 21:33:38
Message-ID:	1299447218.1696.5536.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2011-03-06 at 16:51 +0900, Fujii Masao wrote:

> One comment; what about introducing built-in function to wake up all the
> waiting backends? When replication connection is closed, if we STONITH
> the standby, we can safely (for not physical data loss but logical one)
> switch the primary to standalone mode. But there is no way to wake up
> the waiting backends for now. Setting synchronous_replication to OFF
> and reloading the configuration file doesn't affect the existing waiting
> backends. The attached patch introduces the "pg_wakeup_all_waiters"
> (better name?) function which wakes up all the backends on the queue.

Will apply this as a separate commit.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-07 00:37:04
Message-ID:	1299458224.1696.6355.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2011-03-05 at 21:11 +0100, Yeb Havinga wrote:

> I also got a first first > 1000 tps score

The committed version should be even faster. Would appreciate a retest.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-07 13:20:08
Message-ID:	4D74DB88.9040101@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2011-03-07 01:37, Simon Riggs wrote:
> On Sat, 2011-03-05 at 21:11 +0100, Yeb Havinga wrote:
>
>> I also got a first first> 1000 tps score
> The committed version should be even faster. Would appreciate a retest.
>
pgbench 5 minute test pgbench -c 10 -M prepared -T 300 test
dbsize was -s 50, 1Gbit Ethernet

1 async standby
tps = 2475.285931 (excluding connections establishing)

2 async standbys
tps = 2333.670561 (excluding connections establishing)

1 sync standby
tps = 1277.082753 (excluding connections establishing)

1 sync, 1 async standby
tps = 1273.317386 (excluding connections establishing)

Hard for me to not revert to superlatives right now! :-)

regards,
Yeb Havinga

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-07 13:33:38
Message-ID:	1299504818.1696.8783.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2011-03-07 at 14:20 +0100, Yeb Havinga wrote:
> On 2011-03-07 01:37, Simon Riggs wrote:
> > On Sat, 2011-03-05 at 21:11 +0100, Yeb Havinga wrote:
> >
> >> I also got a first first> 1000 tps score
> > The committed version should be even faster. Would appreciate a retest.
> >
> pgbench 5 minute test pgbench -c 10 -M prepared -T 300 test
> dbsize was -s 50, 1Gbit Ethernet
>
> 1 async standby
> tps = 2475.285931 (excluding connections establishing)
>
> 2 async standbys
> tps = 2333.670561 (excluding connections establishing)
>
> 1 sync standby
> tps = 1277.082753 (excluding connections establishing)
>
> 1 sync, 1 async standby
> tps = 1273.317386 (excluding connections establishing)
>
> Hard for me to not revert to superlatives right now! :-)

That looks like good news, thanks.

It shows that sync rep is "fairly fast", but it also shows clearly why
you'd want to mix sync and async replication within an application.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sync Rep v19
Date:	2011-03-08 12:05:08
Message-ID:	AANLkTiko6-COABo+oVnRJ+t6Vh99FvYAM3Seu30=tnef@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 7, 2011 at 4:54 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mar 6, 2011, at 9:44 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Sun, Mar 6, 2011 at 5:02 PM, Yeb Havinga <yebhavinga(at)gmail(dot)com> wrote:
>>> On Sun, Mar 6, 2011 at 8:58 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>>
>>>> If unfortunately all connection slots are used by backends waiting for
>>>> replication, we cannot execute such a function. So it makes more sense
>>>> to introduce something like "pg_ctl standalone" command?
>>>
>>> If it is only for shutdown, maybe pg_ctl stop -m standalone?
>>
>> It's for not only shutdown but also running the primary in standalone mode.
>> So something like "pg_ctl standalone" is better.
>>
>> For now I think that pg_ctl command is better than built-in function because
>> sometimes we might want to wake waiters up even during shutdown in
>> order to cause shutdown to end. During shutdown, the server doesn't
>> accept any new connection (even from the standby). So, without something
>> like "pg_ctl standalone", there is no way to cause shutdown to end.
>
> This sounds like an awful hack to work around a bad design. Surely once shutdown reaches a point where new replication connections can no longer be accepted, any standbys hung on commit need to close the connection without responding to the COMMIT, per previous discussion. It's completely unreasonable for sync rep to break the shutdown sequence.

Yeah, let's think about how shutdown should work. I'd like to propose the
following. Thought?

* Smart shutdown
Smart shutdown should wait for all the waiting backends to be acked, and
should not cause them to forcibly exit. But this leads shutdown to get stuck
infinitely if there is no walsender at that time. To enable them to be acked
even in that situation, we need to change postmaster so that it accepts the
replication connection even during smart shutdown (until we reach
PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser
connection to cancel backup during smart shutdown. So I don't think that
the idea to accept the replication connection during smart shutdown is so
ugly.

* Fast shutdown
I agree with you about fast shutdown. Fast shutdown should cause all the
backends including waiting ones to exit immediately. At that time, the
non-acked backend should not return the success, according to the
definition of sync rep. So we need to change a backend so that it gets rid
of itself from the waiting queue and exits before returning the success,
when it receives SIGTERM. This change leads the waiting backends to
do the same even when pg_terminate_backend is called. But since
they've not been acked yet, it seems to be reasonable to prevent them
from returning the COMMIT.

Comments? I'll create the patch barring objection.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sync Rep v19
Date:	2011-03-08 16:58:35
Message-ID:	AANLkTimAobST6Jq_axq1jS+Qd--WZ0u8ABTJwL5rhE4D@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> Yeah, let's think about how shutdown should work. I'd like to propose the
> following. Thought?
>
> * Smart shutdown
> Smart shutdown should wait for all the waiting backends to be acked, and
> should not cause them to forcibly exit. But this leads shutdown to get stuck
> infinitely if there is no walsender at that time. To enable them to be acked
> even in that situation, we need to change postmaster so that it accepts the
> replication connection even during smart shutdown (until we reach
> PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser
> connection to cancel backup during smart shutdown. So I don't think that
> the idea to accept the replication connection during smart shutdown is so
> ugly.
>
> * Fast shutdown
> I agree with you about fast shutdown. Fast shutdown should cause all the
> backends including waiting ones to exit immediately. At that time, the
> non-acked backend should not return the success, according to the
> definition of sync rep. So we need to change a backend so that it gets rid
> of itself from the waiting queue and exits before returning the success,
> when it receives SIGTERM. This change leads the waiting backends to
> do the same even when pg_terminate_backend is called. But since
> they've not been acked yet, it seems to be reasonable to prevent them
> from returning the COMMIT.

The fast shutdown handling seems fine, but why not just handle smart
shutdown the same way? I don't really like the idea of allowing
replication connections for longer, and the idea that we don't want to
keep waiting for a commit ACK once we're past the point where it's
possible for one to occur seems to apply generically to any shutdown
sequence.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sync Rep v19
Date:	2011-03-09 05:14:01
Message-ID:	AANLkTikZ1GxpYxz=0__reVPOCozoNkTQ6Vu1qt9xn+9Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 8, 2011 at 11:58 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> The fast shutdown handling seems fine, but why not just handle smart
> shutdown the same way?

currently, smart shutdown means no new connections, wait until
existing ones close normally. for consistency, it should behave the
same for sync rep.

+1 for Fujii's proposal

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-10 02:21:25
Message-ID:	201103100221.p2A2LPL08182@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Is there a reason these status are all upper-case?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-10 08:03:51
Message-ID:	1299744231.1966.11865.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2011-03-09 at 21:21 -0500, Bruce Momjian wrote:
> Simon Riggs wrote:
> > On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote:
> >
> > > postgres=# SELECT application_name, state, sync_priority, sync_state
> > > FROM pg_stat_replication;
> > > application_name | state | sync_priority | sync_state
> > > ------------------+-----------+---------------+------------
> > > one | STREAMING | 1 | POTENTIAL
> > > two | streaming | 2 | sync
> > > (2 rows)
> >
> > Bug! Thanks.
>
> Is there a reason these status are all upper-case?

NOT AS FAR AS I KNOW.

I'll add it to the list of changes for beta.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-10 15:33:04
Message-ID:	AANLkTimCQRM38Bzvjzm9FVT=0DOXh7yy6TizAyRATPFY@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 9, 2011 at 9:21 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Simon Riggs wrote:
>> On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote:
>>
>> > postgres=# SELECT application_name, state, sync_priority, sync_state
>> > FROM pg_stat_replication;
>> > application_name | state | sync_priority | sync_state
>> > ------------------+-----------+---------------+------------
>> > one | STREAMING | 1 | POTENTIAL
>> > two | STREAMING | 2 | SYNC
>> > (2 rows)
>>
>> Bug! Thanks.
>
> Is there a reason these status are all upper-case?

Not that I know of.

However, I think that some more fundamental rethinking of the "state"
mechanism may be in order. When Magnus first committed this, it would
say CATCHUP whenever you were behind (even if only momentarily) and
STREAMING if you were caught up. Simon then changed it so that it
says CATCHUP until you catch up the first time, and then STREAMING
afterward (even if you fall behind again). Neither behavior seems
completely adequate to me. I think we should have a way to know
whether we've ever been caught up, and if so when the most recent time
was. So you could then say things like "is the most recent time at
which the standby was caught up within the last 30 seconds?", which
would be a useful thing to monitor, and right now there's no way to do
it. There's also a BACKUP state, but I'm not sure it makes sense to
lump that in with the others. Some day it might be possible to stream
WAL and take a backup at the same time, over the same connection.
Maybe that should be a separate column or something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-10 19:42:10
Message-ID:	m2r5aezrsd.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> was. So you could then say things like "is the most recent time at
> which the standby was caught up within the last 30 seconds?", which
> would be a useful thing to monitor, and right now there's no way to do

Well in my experience with replication, that's not what I want to
monitor. If the standby is synchronous, then it's not catching up, it's
streaming. If it were not, it would not be a synchronous standby.

When a standby is asynchronous then what I want to monitor is its lag.

So the CATCHUP state is useful to see that a synchronous standby
candidate can not yet be a synchronous standby. When it just lost its
synchronous status (and hopefully another standby is now the sync one),
then it's just asynchronous and I want to know its lag.

> it. There's also a BACKUP state, but I'm not sure it makes sense to
> lump that in with the others. Some day it might be possible to stream
> WAL and take a backup at the same time, over the same connection.
> Maybe that should be a separate column or something.

BACKUP is still meaningful if you stream WAL at the same time, because
you're certainly *not* applying them while doing the base backup, are
you? So you're not yet a standby, that's what BACKUP means.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-10 19:56:44
Message-ID:	AANLkTimw-urqbaViOT-kK95_6O=vzQ8MvdYpSnnnyDAa@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 10, 2011 at 2:42 PM, Dimitri Fontaine
<dimitri(at)2ndquadrant(dot)fr> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> was. So you could then say things like "is the most recent time at
>> which the standby was caught up within the last 30 seconds?", which
>> would be a useful thing to monitor, and right now there's no way to do
>
> Well in my experience with replication, that's not what I want to
> monitor. If the standby is synchronous, then it's not catching up, it's
> streaming. If it were not, it would not be a synchronous standby.
>
> When a standby is asynchronous then what I want to monitor is its lag.
>
> So the CATCHUP state is useful to see that a synchronous standby
> candidate can not yet be a synchronous standby. When it just lost its
> synchronous status (and hopefully another standby is now the sync one),
> then it's just asynchronous and I want to know its lag.

Yeah, maybe. The trick is how to measure the lag. I proposed the
above scheme mostly as a way of giving the user some piece of
information that can be measured in seconds rather than WAL position,
but I'm open to better ideas. Monitoring is pretty hard to do at all
in 9.0; in 9.1, we'll be able to tell them how many *bytes* behind
they are, but there's no easy way to figure out what that means in
terms of wall-clock time, which I think would be useful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-10 20:29:10
Message-ID:	m2bp1izpm1.fsf@2ndQuadrant.fr
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> they are, but there's no easy way to figure out what that means in
> terms of wall-clock time, which I think would be useful.

Jan Wieck had a detailed proposal to make that happen at last developper
meeting, but then ran out of time to implement it for 9.1 it seems. The
idea was basically to have a ticker in core, an SRF that would associate
txid_snapshot with wall clock time. Lots of good things would come from
that.

http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php

Of course if you think that's important enough for you to implement it
between now and beta, that would be great :)

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-10 20:50:48
Message-ID:	AANLkTikw4GQ6hGJqfrK91=vRHAppF653Ce=EvmxL+75S@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 10, 2011 at 3:29 PM, Dimitri Fontaine
<dimitri(at)2ndquadrant(dot)fr> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> they are, but there's no easy way to figure out what that means in
>> terms of wall-clock time, which I think would be useful.
>
> Jan Wieck had a detailed proposal to make that happen at last developper
> meeting, but then ran out of time to implement it for 9.1 it seems. The
> idea was basically to have a ticker in core, an SRF that would associate
> txid_snapshot with wall clock time. Lots of good things would come from
> that.
>
> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php
>
> Of course if you think that's important enough for you to implement it
> between now and beta, that would be great :)

I think that's actually something a little different, and more
complicated, but I do think it'd be useful. I was hoping there was a
simple way to get some kind of time-based information into
pg_stat_replication, but if there isn't, there isn't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-11 12:08:09
Message-ID:	AANLkTikR052OqNYVTTUQuRMR_14575wJVzWCJiLB+NUY@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 11, 2011 at 5:50 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Mar 10, 2011 at 3:29 PM, Dimitri Fontaine
> <dimitri(at)2ndquadrant(dot)fr> wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> they are, but there's no easy way to figure out what that means in
>>> terms of wall-clock time, which I think would be useful.
>>
>> Jan Wieck had a detailed proposal to make that happen at last developper
>> meeting, but then ran out of time to implement it for 9.1 it seems. The
>> idea was basically to have a ticker in core, an SRF that would associate
>> txid_snapshot with wall clock time. Lots of good things would come from
>> that.
>>
>> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php
>>
>> Of course if you think that's important enough for you to implement it
>> between now and beta, that would be great :)
>
> I think that's actually something a little different, and more
> complicated, but I do think it'd be useful. I was hoping there was a
> simple way to get some kind of time-based information into
> pg_stat_replication, but if there isn't, there isn't.

How about sending the timestamp of last applied transaction
(i.e., this is the return value of pg_last_xact_replay_timestamp)
from the standby to the master, and reporting it in
pg_stat_replication? Then you can see the lag by comparing
it with current_timestamp.

But since the last replay timestamp doesn't advance (but
current timestamp advances) if there is no work on the master,
the calculated lag might be unexpectedly too large. So, to
calculate the exact lag, I'm thinking that we should introduce
new function which returns the timestamp of the last transaction
written in the master.

Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-11 13:02:25
Message-ID:	AANLkTi=9itH+WDUxB-RemrcW3L3QiMFkWkk3d8t_eR+a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 11, 2011 at 7:08 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Fri, Mar 11, 2011 at 5:50 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Mar 10, 2011 at 3:29 PM, Dimitri Fontaine
>> <dimitri(at)2ndquadrant(dot)fr> wrote:
>>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>>> they are, but there's no easy way to figure out what that means in
>>>> terms of wall-clock time, which I think would be useful.
>>>
>>> Jan Wieck had a detailed proposal to make that happen at last developper
>>> meeting, but then ran out of time to implement it for 9.1 it seems. The
>>> idea was basically to have a ticker in core, an SRF that would associate
>>> txid_snapshot with wall clock time. Lots of good things would come from
>>> that.
>>>
>>> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php
>>>
>>> Of course if you think that's important enough for you to implement it
>>> between now and beta, that would be great :)
>>
>> I think that's actually something a little different, and more
>> complicated, but I do think it'd be useful. I was hoping there was a
>> simple way to get some kind of time-based information into
>> pg_stat_replication, but if there isn't, there isn't.
>
> How about sending the timestamp of last applied transaction
> (i.e., this is the return value of pg_last_xact_replay_timestamp)
> from the standby to the master, and reporting it in
> pg_stat_replication? Then you can see the lag by comparing
> it with current_timestamp.
>
> But since the last replay timestamp doesn't advance (but
> current timestamp advances) if there is no work on the master,
> the calculated lag might be unexpectedly too large. So, to
> calculate the exact lag, I'm thinking that we should introduce
> new function which returns the timestamp of the last transaction
> written in the master.
>
> Thought?

Hmm... where would we get that value from? And what if no
transactions are running on the master?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-11 13:21:52
Message-ID:	AANLkTi=WTHOvL2PtS=2U_eW4GxoM82VDOBWk-255fxQk@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 11, 2011 at 10:02 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> How about sending the timestamp of last applied transaction
>> (i.e., this is the return value of pg_last_xact_replay_timestamp)
>> from the standby to the master, and reporting it in
>> pg_stat_replication? Then you can see the lag by comparing
>> it with current_timestamp.
>>
>> But since the last replay timestamp doesn't advance (but
>> current timestamp advances) if there is no work on the master,
>> the calculated lag might be unexpectedly too large. So, to
>> calculate the exact lag, I'm thinking that we should introduce
>> new function which returns the timestamp of the last transaction
>> written in the master.
>>
>> Thought?
>
> Hmm... where would we get that value from?

xl_xact_commit->xact_time (which is set in RecordTransactionCommit)
and xl_xact_abort->xact_time (which is set in RecordTransactionAbort).

> And what if no
> transactions are running on the master?

In that case, the last write WAL timestamp would become equal to the
last replay WAL timestamp. So we can see that there is no lag.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-11 14:03:33
Message-ID:	AANLkTim+5k4YYkhnOGDvuyAAo45mbPFVAzRyRWnBR6PJ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 11, 2011 at 8:21 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Fri, Mar 11, 2011 at 10:02 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> How about sending the timestamp of last applied transaction
>>> (i.e., this is the return value of pg_last_xact_replay_timestamp)
>>> from the standby to the master, and reporting it in
>>> pg_stat_replication? Then you can see the lag by comparing
>>> it with current_timestamp.
>>>
>>> But since the last replay timestamp doesn't advance (but
>>> current timestamp advances) if there is no work on the master,
>>> the calculated lag might be unexpectedly too large. So, to
>>> calculate the exact lag, I'm thinking that we should introduce
>>> new function which returns the timestamp of the last transaction
>>> written in the master.
>>>
>>> Thought?
>>
>> Hmm... where would we get that value from?
>
> xl_xact_commit->xact_time (which is set in RecordTransactionCommit)
> and xl_xact_abort->xact_time (which is set in RecordTransactionAbort).
>
>> And what if no
>> transactions are running on the master?
>
> In that case, the last write WAL timestamp would become equal to the
> last replay WAL timestamp. So we can see that there is no lag.

Oh, I see (I think). You're talking about write/replay lag, but I was
thinking of master/slave transmission lag.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	"Ross J(dot) Reedstrom" <reedstrm(at)rice(dot)edu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-03-11 19:42:59
Message-ID:	20110311194258.GB12106@rice.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 11, 2011 at 09:03:33AM -0500, Robert Haas wrote:
> On Fri, Mar 11, 2011 at 8:21 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> >
> > In that case, the last write WAL timestamp would become equal to the
> > last replay WAL timestamp. So we can see that there is no lag.
>
> Oh, I see (I think). You're talking about write/replay lag, but I was
> thinking of master/slave transmission lag.
>

Which are both useful numbers to know: the first tells you how "stale"
queries from a Hot Standby will be, the second tells you the maximum
data loss from a "meteor hits the master" scenario where that slave is
promoted, if I understand all the interactions correctly.

Ross
--
Ross Reedstrom, Ph.D. reedstrm(at)rice(dot)edu
Systems Engineer & Admin, Research Scientist phone: 713-348-6166
Connexions http://cnx.org fax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sync Rep v19
Date:	2011-03-19 02:25:10
Message-ID:	AANLkTimyxZJxvjS3d6T9yRj=u3k9RFO+Hzrvj207yorS@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> * Smart shutdown
> Smart shutdown should wait for all the waiting backends to be acked, and
> should not cause them to forcibly exit. But this leads shutdown to get stuck
> infinitely if there is no walsender at that time. To enable them to be acked
> even in that situation, we need to change postmaster so that it accepts the
> replication connection even during smart shutdown (until we reach
> PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser
> connection to cancel backup during smart shutdown. So I don't think that
> the idea to accept the replication connection during smart shutdown is so
> ugly.
>
> * Fast shutdown
> I agree with you about fast shutdown. Fast shutdown should cause all the
> backends including waiting ones to exit immediately. At that time, the
> non-acked backend should not return the success, according to the
> definition of sync rep. So we need to change a backend so that it gets rid
> of itself from the waiting queue and exits before returning the success,
> when it receives SIGTERM. This change leads the waiting backends to
> do the same even when pg_terminate_backend is called. But since
> they've not been acked yet, it seems to be reasonable to prevent them
> from returning the COMMIT.
>
> Comments? I'll create the patch barring objection.

The fast smart shutdown part of this problem has been addressed. The
smart shutdown case still needs work, and I think the consensus was
that your proposal above was the best way to go with it.

Do you still want to work up a patch for this? If so, I can review.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sync Rep v19
Date:	2011-03-19 02:28:33
Message-ID:	AANLkTik9AAR+T5Y_Qeaop=c5XC4jmTTNm_AQnbEOpCWi@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 18, 2011 at 10:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> * Smart shutdown
>> Smart shutdown should wait for all the waiting backends to be acked, and
>> should not cause them to forcibly exit. But this leads shutdown to get stuck
>> infinitely if there is no walsender at that time. To enable them to be acked
>> even in that situation, we need to change postmaster so that it accepts the
>> replication connection even during smart shutdown (until we reach
>> PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser
>> connection to cancel backup during smart shutdown. So I don't think that
>> the idea to accept the replication connection during smart shutdown is so
>> ugly.
>>
>> * Fast shutdown
>> I agree with you about fast shutdown. Fast shutdown should cause all the
>> backends including waiting ones to exit immediately. At that time, the
>> non-acked backend should not return the success, according to the
>> definition of sync rep. So we need to change a backend so that it gets rid
>> of itself from the waiting queue and exits before returning the success,
>> when it receives SIGTERM. This change leads the waiting backends to
>> do the same even when pg_terminate_backend is called. But since
>> they've not been acked yet, it seems to be reasonable to prevent them
>> from returning the COMMIT.
>>
>> Comments? I'll create the patch barring objection.
>
> The fast smart shutdown part of this problem has been addressed. The

Ugh. I mean "the fast shutdown", of course, not "the fast smart
shutdown". Anyway, point is:

fast shutdown now OK
smart shutdown still not OK
do you want to write a patch?

:-)

> smart shutdown case still needs work, and I think the consensus was
> that your proposal above was the best way to go with it.
>
> Do you still want to work up a patch for this? If so, I can review.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sync Rep v19
Date:	2011-03-23 08:53:14
Message-ID:	AANLkTi=4nF9w_UnDerqN66HCThK+av4we=sy7yP13--K@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 19, 2011 at 11:28 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 18, 2011 at 10:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>> * Smart shutdown
>>> Smart shutdown should wait for all the waiting backends to be acked, and
>>> should not cause them to forcibly exit. But this leads shutdown to get stuck
>>> infinitely if there is no walsender at that time. To enable them to be acked
>>> even in that situation, we need to change postmaster so that it accepts the
>>> replication connection even during smart shutdown (until we reach
>>> PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser
>>> connection to cancel backup during smart shutdown. So I don't think that
>>> the idea to accept the replication connection during smart shutdown is so
>>> ugly.
>>>
>>> * Fast shutdown
>>> I agree with you about fast shutdown. Fast shutdown should cause all the
>>> backends including waiting ones to exit immediately. At that time, the
>>> non-acked backend should not return the success, according to the
>>> definition of sync rep. So we need to change a backend so that it gets rid
>>> of itself from the waiting queue and exits before returning the success,
>>> when it receives SIGTERM. This change leads the waiting backends to
>>> do the same even when pg_terminate_backend is called. But since
>>> they've not been acked yet, it seems to be reasonable to prevent them
>>> from returning the COMMIT.
>>>
>>> Comments? I'll create the patch barring objection.
>>
>> The fast smart shutdown part of this problem has been addressed. The
>
> Ugh. I mean "the fast shutdown", of course, not "the fast smart
> shutdown". Anyway, point is:
>
> fast shutdown now OK
> smart shutdown still not OK
> do you want to write a patch?
>
> :-)
>
>> smart shutdown case still needs work, and I think the consensus was
>> that your proposal above was the best way to go with it.
>>
>> Do you still want to work up a patch for this? If so, I can review.

Sure. Will do.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sync Rep v19
Date:	2011-03-24 11:17:30
Message-ID:	AANLkTino3S6AGN1oisWozuUT_CRvONodbL9GsWpr7dYB@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 23, 2011 at 5:53 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>> Do you still want to work up a patch for this? If so, I can review.
>
> Sure. Will do.

The attached patch allows standby servers to connect during smart shutdown
in order to wake up backends waiting for sync rep.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment	Content-Type	Size
allow_standby_to_connect_during_smart_shutdown_v1.patch	application/octet-stream	10.4 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sync Rep v19
Date:	2011-03-24 11:34:40
Message-ID:	AANLkTimmfafic5Qj=BuAz9wEGbZ8tpY-j0MPf=mfEpbH@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 24, 2011 at 11:17 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Wed, Mar 23, 2011 at 5:53 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>> Do you still want to work up a patch for this? If so, I can review.
>>
>> Sure. Will do.
>
> The attached patch allows standby servers to connect during smart shutdown
> in order to wake up backends waiting for sync rep.

I think that is possibly OK, but the big problem is the lack of a
clear set of comments about how the states are supposed to interact
that allow these changes to be validated.

That state isn't down to you, but I think we need that clear so this
change is more obviously correct than it is now.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sync Rep v19
Date:	2011-03-24 11:53:42
Message-ID:	AANLkTinwzV6UCiWbZzWx7vvx0aEATAEDTAxbtcEKR76p@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 24, 2011 at 8:34 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Thu, Mar 24, 2011 at 11:17 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Wed, Mar 23, 2011 at 5:53 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>>> Do you still want to work up a patch for this? If so, I can review.
>>>
>>> Sure. Will do.
>>
>> The attached patch allows standby servers to connect during smart shutdown
>> in order to wake up backends waiting for sync rep.
>
> I think that is possibly OK, but the big problem is the lack of a
> clear set of comments about how the states are supposed to interact
> that allow these changes to be validated.
>
> That state isn't down to you, but I think we need that clear so this
> change is more obviously correct than it is now.

src/backend/replication/README needs to be updated to make that clear?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Yeb Havinga <yebhavinga(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Sync Rep v19
Date:	2011-03-28 19:03:45
Message-ID:	AANLkTinXZoVH_vMtow6A+XQNi7MGMs=_BT96ZBcqq8pt@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 24, 2011 at 11:53 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Thu, Mar 24, 2011 at 8:34 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On Thu, Mar 24, 2011 at 11:17 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>> On Wed, Mar 23, 2011 at 5:53 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>>>> Do you still want to work up a patch for this? If so, I can review.
>>>>
>>>> Sure. Will do.
>>>
>>> The attached patch allows standby servers to connect during smart shutdown
>>> in order to wake up backends waiting for sync rep.
>>
>> I think that is possibly OK, but the big problem is the lack of a
>> clear set of comments about how the states are supposed to interact
>> that allow these changes to be validated.
>>
>> That state isn't down to you, but I think we need that clear so this
>> change is more obviously correct than it is now.
>
> src/backend/replication/README needs to be updated to make that clear?

Not sure where we'd put them.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-04-26 15:40:25
Message-ID:	201104261540.p3QFePO16549@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Wed, 2011-03-09 at 21:21 -0500, Bruce Momjian wrote:
> > Simon Riggs wrote:
> > > On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote:
> > >
> > > > postgres=# SELECT application_name, state, sync_priority, sync_state
> > > > FROM pg_stat_replication;
> > > > application_name | state | sync_priority | sync_state
> > > > ------------------+-----------+---------------+------------
> > > > one | STREAMING | 1 | POTENTIAL
> > > > two | streaming | 2 | sync
> > > > (2 rows)
> > >
> > > Bug! Thanks.
> >
> > Is there a reason these status are all upper-case?
>
> NOT AS FAR AS I KNOW.
>
> I'll add it to the list of changes for beta.

The attached patch lowercases the labels displayed in the view above.
(The example above was originally all upper-case.)

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachment	Content-Type	Size
/rtmp/label	text/x-diff	1.6 KB

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Sync Rep v19
Date:	2011-04-30 02:29:01
Message-ID:	201104300229.p3U2T1Y08653@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian wrote:
> Simon Riggs wrote:
> > On Wed, 2011-03-09 at 21:21 -0500, Bruce Momjian wrote:
> > > Simon Riggs wrote:
> > > > On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote:
> > > >
> > > > > postgres=# SELECT application_name, state, sync_priority, sync_state
> > > > > FROM pg_stat_replication;
> > > > > application_name | state | sync_priority | sync_state
> > > > > ------------------+-----------+---------------+------------
> > > > > one | STREAMING | 1 | POTENTIAL
> > > > > two | streaming | 2 | sync
> > > > > (2 rows)
> > > >
> > > > Bug! Thanks.
> > >
> > > Is there a reason these status are all upper-case?
> >
> > NOT AS FAR AS I KNOW.
> >
> > I'll add it to the list of changes for beta.
>
> The attached patch lowercases the labels displayed in the view above.
> (The example above was originally all upper-case.)

Applied.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +