Quick Links

Re: Keepalive for max_standby_delay

Lists:	pgsql-hackers

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Keepalive for max_standby_delay
Date:	2010-05-15 13:17:10
Message-ID:	1273929430.308.8960.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Patch adds a keepalive message to ensure max_standby_delay is useful.

No WAL format changes, no libpq changes. Just an additional message type
for the streaming replication protocol, sent once per main loop in
WALsender. Plus docs.

Comments?

--
Simon Riggs www.2ndQuadrant.com

Attachment	Content-Type	Size
hs_keepalive.patch	text/x-patch	5.1 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-15 15:45:52
Message-ID:	25768.1273938352@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> Patch adds a keepalive message to ensure max_standby_delay is useful.

The proposed placement of the keepalive-send is about the worst it could
possibly be. It needs to be done right before pq_flush to ensure
minimum transfer delay. Otherwise any attempt to measure clock skew
using the timestamp will be seriously off. Where you've placed it
guarantees maximum delay not minimum.

I'm also extremely dubious that it's a good idea to set
recoveryLastXTime from this. Using both that and the timestamps from
the wal log flies in the face of everything I remember about control
theory. We should be doing only one or only the other, I think.

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-15 16:23:58
Message-ID:	1273940638.308.9011.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > Patch adds a keepalive message to ensure max_standby_delay is useful.
>
> The proposed placement of the keepalive-send is about the worst it could
> possibly be. It needs to be done right before pq_flush to ensure
> minimum transfer delay. Otherwise any attempt to measure clock skew
> using the timestamp will be seriously off. Where you've placed it
> guarantees maximum delay not minimum.

I don't understand. WalSndKeepAlive() contains a pq_flush() immediately
after the timestamp is set. I did that way for exactly the same reasons
you've said.

Do you mean you only want to see one pq_flush()? I used two so that the
actual data is never delayed by a keepalive. WAL Sender was going to
sleep anyway, so shouldn't be a problem.

> I'm also extremely dubious that it's a good idea to set
> recoveryLastXTime from this. Using both that and the timestamps from
> the wal log flies in the face of everything I remember about control
> theory. We should be doing only one or only the other, I think.

I can change it so that the recoveryLastXTime will not be updated if we
are using the value from the keepalives. So we have one, or the other.
Remember that replication can switch backwards and forwards between
modes, so it seems sensible to have a common timing value whichever mode
we're in.

--
Simon Riggs www.2ndQuadrant.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-15 16:30:38
Message-ID:	4BEECC2E.5080707@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote:
>> I'm also extremely dubious that it's a good idea to set
>> recoveryLastXTime from this. Using both that and the timestamps from
>> the wal log flies in the face of everything I remember about control
>> theory. We should be doing only one or only the other, I think.
>
> I can change it so that the recoveryLastXTime will not be updated if we
> are using the value from the keepalives. So we have one, or the other.
> Remember that replication can switch backwards and forwards between
> modes, so it seems sensible to have a common timing value whichever mode
> we're in.

That means that recoveryLastXTime can jump forwards and backwards.
Doesn't feel right to me either. If you want to expose the
keepalive-time to queries, it should be a separate field, something like
lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
read it.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-15 16:49:07
Message-ID:	1273942147.308.9014.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Sat, 2010-05-15 at 11:45 -0400, Tom Lane wrote:
> >> I'm also extremely dubious that it's a good idea to set
> >> recoveryLastXTime from this. Using both that and the timestamps from
> >> the wal log flies in the face of everything I remember about control
> >> theory. We should be doing only one or only the other, I think.
> >
> > I can change it so that the recoveryLastXTime will not be updated if we
> > are using the value from the keepalives. So we have one, or the other.
> > Remember that replication can switch backwards and forwards between
> > modes, so it seems sensible to have a common timing value whichever mode
> > we're in.
>
> That means that recoveryLastXTime can jump forwards and backwards.

That behaviour would be bad, so why not just prevent that from
happening?

> Doesn't feel right to me either. If you want to expose the
> keepalive-time to queries, it should be a separate field, something like
> lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
> read it.

That wouldn't be good because then you couldn't easily monitor the
delay? You'd have to run two different functions depending on the state
of replication (for which we would need yet another function). Users
would just wrap that back up into a single function.

--
Simon Riggs www.2ndQuadrant.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-15 17:05:41
Message-ID:	4BEED465.5050200@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote:
>> Doesn't feel right to me either. If you want to expose the
>> keepalive-time to queries, it should be a separate field, something like
>> lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
>> read it.
>
> That wouldn't be good because then you couldn't easily monitor the
> delay? You'd have to run two different functions depending on the state
> of replication (for which we would need yet another function). Users
> would just wrap that back up into a single function.

What exactly is the user trying to monitor? If it's "how far behind is
the standby", the difference between pg_current_xlog_insert_location()
in the master and pg_last_xlog_replay_location() in the standby seems
more robust and well-defined to me. It's a measure of XLOG location (ie.
bytes) instead of time, but time is a complicated concept.

Also note that as the patch stands, if you receive a keep-alive from the
master at point X, it doesn't mean that the standby is fully up-to-date.
It's possible that the walsender just finished sending a huge batch of
accumulated WAL, say 1 GB, and it took 1 hour for the batch to be sent.
During that time, a lot more WAL has accumulated, yet walsender sends a
keep-alive with the current timestamp.

In general, the purpose of a keep-alive is to keep the connection alive,
but you're trying to accomplish something else too, and I don't fully
understand what it is.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-15 17:24:30
Message-ID:	1273944270.308.9032.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2010-05-15 at 20:05 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Sat, 2010-05-15 at 19:30 +0300, Heikki Linnakangas wrote:
> >> Doesn't feel right to me either. If you want to expose the
> >> keepalive-time to queries, it should be a separate field, something like
> >> lastMasterKeepaliveTime and a pg_last_master_keepalive() function to
> >> read it.
> >
> > That wouldn't be good because then you couldn't easily monitor the
> > delay? You'd have to run two different functions depending on the state
> > of replication (for which we would need yet another function). Users
> > would just wrap that back up into a single function.
>
> What exactly is the user trying to monitor? If it's "how far behind is
> the standby", the difference between pg_current_xlog_insert_location()
> in the master and pg_last_xlog_replay_location() in the standby seems
> more robust and well-defined to me. It's a measure of XLOG location (ie.
> bytes) instead of time, but time is a complicated concept.

Maybe, but its meaningful to users and that is the point.

> Also note that as the patch stands, if you receive a keep-alive from the
> master at point X, it doesn't mean that the standby is fully up-to-date.
> It's possible that the walsender just finished sending a huge batch of
> accumulated WAL, say 1 GB, and it took 1 hour for the batch to be sent.
> During that time, a lot more WAL has accumulated, yet walsender sends a
> keep-alive with the current timestamp.

Not at all. The timestamp for the keepalive is calculated after the
pq_flush for the main WAL data. So it takes 10 years to send the next
blob of WAL data the timestamp will be current.

However, a point you made in an earlier thread is still true here. It
sounds like it would be best if startup process didn't re-access shared
memory for the next location until it had fully replayed all the WAL up
to the point it last accessed. That way the changing value of the shared
timestamp would have no effect on the calculated value at any point in
time. I will recode using that concept.

> In general, the purpose of a keep-alive is to keep the connection alive,
> but you're trying to accomplish something else too, and I don't fully
> understand what it is.

That surprises me. If nothing else, its in the title of the thread,
though since you personally added this to the Hot Standby todo more than
6 months ago I'd hope you of all people would understand the purpose.

--
Simon Riggs www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-15 18:50:55
Message-ID:	1273949455.308.9080.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote:

> I will recode using that concept.

There's some behaviours that aren't helpful here:

Startup gets new pointer when it runs out of data to replay. That might
or might not include an updated keepalive timestamp, since there's no
exact relationship between chunks sent and chunks received. Startup
might ask for a new chunk when half a chunk has been received, or when
multiple chunks have been received.

WALSender doesn't chunk up what it sends, though libpq does at a lower
level. So there's no way to make a keepalive happen every X bytes
without doing this from within libpq.

WALSender sleeps even when it might have more WAL to send, it doesn't
check it just unconditionally sleeps. At least WALReceiver loops until
it has no more to receive. I just can't imagine why that's useful
behaviour.

All in all, I think we should be calling this "burst replication" not
streaming replication. That does cause an issue in trying to monitor
what's going on cos there's so much jitter.

--
Simon Riggs www.2ndQuadrant.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-15 20:08:20
Message-ID:	4BEEFF34.1020106@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> WALSender sleeps even when it might have more WAL to send, it doesn't
> check it just unconditionally sleeps. At least WALReceiver loops until
> it has no more to receive. I just can't imagine why that's useful
> behaviour.

Good catch. That should be fixed.

I also note that walsender doesn't respond to signals, while it's
sending a large batch. That's analogous to the issue that was addressed
recently in the archiver process.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-15 21:05:49
Message-ID:	4BEF0CAD.8050006@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> WALSender sleeps even when it might have more WAL to send, it doesn't
>> check it just unconditionally sleeps. At least WALReceiver loops until
>> it has no more to receive. I just can't imagine why that's useful
>> behaviour.
>
> Good catch. That should be fixed.
>
> I also note that walsender doesn't respond to signals, while it's
> sending a large batch. That's analogous to the issue that was addressed
> recently in the archiver process.

Attached patch rearranges the walsender loops slightly to fix the above.
XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
2) in one round and returns to the main loop after that even if there's
unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
That way the main loop gets to respond to signals quickly, and we also
get to update the shared memory status and PS display more often when
there's a lot of catching up to do.

Comments, have I screwed up anything?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
dont-sleep-if-wal-to-send.patch	text/x-diff	9.1 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-16 15:53:32
Message-ID:	1274025212.308.10829.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2010-05-16 at 00:05 +0300, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > Simon Riggs wrote:
> >> WALSender sleeps even when it might have more WAL to send, it doesn't
> >> check it just unconditionally sleeps. At least WALReceiver loops until
> >> it has no more to receive. I just can't imagine why that's useful
> >> behaviour.
> >
> > Good catch. That should be fixed.
> >
> > I also note that walsender doesn't respond to signals, while it's
> > sending a large batch. That's analogous to the issue that was addressed
> > recently in the archiver process.
>
> Attached patch rearranges the walsender loops slightly to fix the above.
> XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
> 2) in one round and returns to the main loop after that even if there's
> unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
> That way the main loop gets to respond to signals quickly, and we also
> get to update the shared memory status and PS display more often when
> there's a lot of catching up to do.
>
> Comments

8MB at a time still seems like a large batch to me.

libpq is going to send it in smaller chunks anyway, so I don't see the
importance of trying to keep the batch too large. It just introduces
delay into the sending process. We should be sending chunks that matches
libpq better.

--
Simon Riggs www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-16 16:11:35
Message-ID:	1274026295.308.10866.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2010-05-15 at 19:50 +0100, Simon Riggs wrote:
> On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote:
>
> > I will recode using that concept.

> Startup gets new pointer when it runs out of data to replay. That might
> or might not include an updated keepalive timestamp, since there's no
> exact relationship between chunks sent and chunks received. Startup
> might ask for a new chunk when half a chunk has been received, or when
> multiple chunks have been received.

New version, with some other cleanup of wait processing.

New logic is that when Startup asks for next applychunk of WAL it saves
the lastChunkTimestamp. That is then the base time used by
WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later.
Since multiple receivechunks can arrive from primary before Startup asks
for next applychunk we use the oldest receivechunk timestamp, not the
latest. Doing it this way means the lastChunkTimestamp doesn't change
when new keepalives arrive, so we have a stable and reasonably accurate
recordSendTimestamp for each WAL record.

The keepalive is sent as the first part of a new message, if any. So
partial chunks of data always have an accurate timestamp, even if that
is slightly older as a result. Doesn't make much difference except with
very large chunks.

I think that addresses the points raised on this thread and others.

--
Simon Riggs www.2ndQuadrant.com

Attachment	Content-Type	Size
hs_keepalive.v2.patch	text/x-patch	14.1 KB

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-17 02:18:37
Message-ID:	AANLkTimzlXAYByCk7PHAEzBEMVkTFb2MRfbrCV2O2fwr@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, May 16, 2010 at 6:05 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> WALSender sleeps even when it might have more WAL to send, it doesn't
>>> check it just unconditionally sleeps. At least WALReceiver loops until
>>> it has no more to receive. I just can't imagine why that's useful
>>> behaviour.
>>
>> Good catch. That should be fixed.
>>
>> I also note that walsender doesn't respond to signals, while it's
>> sending a large batch. That's analogous to the issue that was addressed
>> recently in the archiver process.
>
> Attached patch rearranges the walsender loops slightly to fix the above.
> XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
> 2) in one round and returns to the main loop after that even if there's
> unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
> That way the main loop gets to respond to signals quickly, and we also
> get to update the shared memory status and PS display more often when
> there's a lot of catching up to do.
>
> Comments

The main loop needs to respond to also the 'X' message from the standby
quickly?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-17 02:51:13
Message-ID:	AANLkTikf2lK65VtvqQGhqsmXhBwVYSU7R_5c8G2fQpb_@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, May 17, 2010 at 1:11 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Sat, 2010-05-15 at 19:50 +0100, Simon Riggs wrote:
>> On Sat, 2010-05-15 at 18:24 +0100, Simon Riggs wrote:
>>
>> > I will recode using that concept.
>
>> Startup gets new pointer when it runs out of data to replay. That might
>> or might not include an updated keepalive timestamp, since there's no
>> exact relationship between chunks sent and chunks received. Startup
>> might ask for a new chunk when half a chunk has been received, or when
>> multiple chunks have been received.
>
> New version, with some other cleanup of wait processing.
>
> New logic is that when Startup asks for next applychunk of WAL it saves
> the lastChunkTimestamp. That is then the base time used by
> WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later.
> Since multiple receivechunks can arrive from primary before Startup asks
> for next applychunk we use the oldest receivechunk timestamp, not the
> latest. Doing it this way means the lastChunkTimestamp doesn't change
> when new keepalives arrive, so we have a stable and reasonably accurate
> recordSendTimestamp for each WAL record.
>
> The keepalive is sent as the first part of a new message, if any. So
> partial chunks of data always have an accurate timestamp, even if that
> is slightly older as a result. Doesn't make much difference except with
> very large chunks.
>
> I think that addresses the points raised on this thread and others.

Is it OK that this keepalive message cannot be used by HS in file-based
log-shipping? Even in SR, the startup process cannot use the keepalive
until walreceiver has been started up.

WalSndKeepAlive() always calls initStringInfo(), which seems to cause
memory-leak.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-17 07:07:04
Message-ID:	1274080024.28911.76.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2010-05-17 at 11:51 +0900, Fujii Masao wrote:

> Is it OK that this keepalive message cannot be used by HS in file-based
> log-shipping? Even in SR, the startup process cannot use the keepalive
> until walreceiver has been started up.

Yes, I see those things.

We already have archive_timeout to handle the keepalive case in
file-based.

When starting up the delay is high anyway, so doesn't really matter
about accuracy - though we do use latestXLogTime in that cases.

> WalSndKeepAlive() always calls initStringInfo(), which seems to cause
> memory-leak.

Thanks.

--
Simon Riggs www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-17 08:40:50
Message-ID:	1274085650.28911.201.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote:
> >
> > Attached patch rearranges the walsender loops slightly to fix the above.
> > XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
> > 2) in one round and returns to the main loop after that even if there's
> > unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
> > That way the main loop gets to respond to signals quickly, and we also
> > get to update the shared memory status and PS display more often when
> > there's a lot of catching up to do.
> >
> > Comments
>
> 8MB at a time still seems like a large batch to me.
>
> libpq is going to send it in smaller chunks anyway, so I don't see the
> importance of trying to keep the batch too large. It just introduces
> delay into the sending process. We should be sending chunks that matches
> libpq better.

More to the point the logic will fail if XLOG_BLCKSZ > PQ_BUFFER_SIZE
because it will send partial pages.

Having MAX_SEND_SIZE > PQ_BUFFER_SIZE is pointless, as libpq currently
stands.

--
Simon Riggs www.2ndQuadrant.com

From:	Jim Nasby <decibel(at)decibel(dot)org>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-17 16:36:07
Message-ID:	27D3E834-8064-4ECD-94A5-9C57C42D9F43@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote:
> What exactly is the user trying to monitor? If it's "how far behind is
> the standby", the difference between pg_current_xlog_insert_location()
> in the master and pg_last_xlog_replay_location() in the standby seems
> more robust and well-defined to me. It's a measure of XLOG location (ie.
> bytes) instead of time, but time is a complicated concept.

I can tell you that end users *will* want a time-based indication of how far behind we are. DBAs will understand "we're this many transactions behind", but managers and end users won't. Unless it's unreasonable to provide that info, we should do so.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-18 21:06:26
Message-ID:	4BF30152.8040508@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 17/05/10 04:40, Simon Riggs wrote:
> On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote:
>>>
>>> Attached patch rearranges the walsender loops slightly to fix the above.
>>> XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
>>> 2) in one round and returns to the main loop after that even if there's
>>> unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
>>> That way the main loop gets to respond to signals quickly, and we also
>>> get to update the shared memory status and PS display more often when
>>> there's a lot of catching up to do.
>>>
>>> Comments
>>
>> 8MB at a time still seems like a large batch to me.
>>
>> libpq is going to send it in smaller chunks anyway, so I don't see the
>> importance of trying to keep the batch too large. It just introduces
>> delay into the sending process. We should be sending chunks that matches
>> libpq better.
>
> More to the point the logic will fail if XLOG_BLCKSZ> PQ_BUFFER_SIZE
> because it will send partial pages.

I don't see a failure. We rely on not splitting WAL records across
messages, but we're talking about libpq-level CopyData messages, not TCP
messages.

> Having MAX_SEND_SIZE> PQ_BUFFER_SIZE is pointless, as libpq currently
> stands.

Well, it does affect the size of the read() in walsender, and I'm sure
there's some overhead in setting the ps display and the other small
stuff we do once per message. But you're probably right that we could
easily make MAX_SEND_SIZE much smaller with no noticeable affect on
performance, while making walsender more responsive to signals. I'll
decrease it to, say, 512 kB.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Jim Nasby <decibel(at)decibel(dot)org>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-18 21:08:13
Message-ID:	4BF301BD.6060102@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 17/05/10 12:36, Jim Nasby wrote:
> On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote:
>> What exactly is the user trying to monitor? If it's "how far behind is
>> the standby", the difference between pg_current_xlog_insert_location()
>> in the master and pg_last_xlog_replay_location() in the standby seems
>> more robust and well-defined to me. It's a measure of XLOG location (ie.
>> bytes) instead of time, but time is a complicated concept.
>
> I can tell you that end users *will* want a time-based indication of how far behind we are. DBAs will understand "we're this many transactions behind", but managers and end users won't. Unless it's unreasonable to provide that info, we should do so.

No doubt about that, the problem is that it's hard to provide a reliable
time-based indication.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-18 21:17:46
Message-ID:	1274217466.28911.3901.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2010-05-18 at 17:06 -0400, Heikki Linnakangas wrote:
> On 17/05/10 04:40, Simon Riggs wrote:
> > On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote:
> >>>
> >>> Attached patch rearranges the walsender loops slightly to fix the above.
> >>> XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
> >>> 2) in one round and returns to the main loop after that even if there's
> >>> unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
> >>> That way the main loop gets to respond to signals quickly, and we also
> >>> get to update the shared memory status and PS display more often when
> >>> there's a lot of catching up to do.
> >>>
> >>> Comments
> >>
> >> 8MB at a time still seems like a large batch to me.
> >>
> >> libpq is going to send it in smaller chunks anyway, so I don't see the
> >> importance of trying to keep the batch too large. It just introduces
> >> delay into the sending process. We should be sending chunks that matches
> >> libpq better.
> >
> > More to the point the logic will fail if XLOG_BLCKSZ> PQ_BUFFER_SIZE
> > because it will send partial pages.
>
> I don't see a failure. We rely on not splitting WAL records across
> messages, but we're talking about libpq-level CopyData messages, not TCP
> messages.

> > Having MAX_SEND_SIZE> PQ_BUFFER_SIZE is pointless, as libpq currently
> > stands.
>
> Well, it does affect the size of the read() in walsender, and I'm sure
> there's some overhead in setting the ps display and the other small
> stuff we do once per message. But you're probably right that we could
> easily make MAX_SEND_SIZE much smaller with no noticeable affect on
> performance, while making walsender more responsive to signals. I'll
> decrease it to, say, 512 kB.

I'm pretty certain we don't need to set the ps display once per message.
ps doesn't need an update 5 times per second on average.

There's no reason that the buffer size we use for XLogRead() should be
the same as the send buffer, if you're worried about that. My point is
that pq_putmessage contains internal flushes so at the libpq level you
gain nothing by big batches. The read() will be buffered anyway with
readahead so not sure what the issue is. We'll have to do this for sync
rep anyway, so what's the big deal? Just do it now, once. Do we really
want 9.1 code to differ here?

--
Simon Riggs www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Jim Nasby <decibel(at)decibel(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-18 21:21:45
Message-ID:	1274217705.28911.3922.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2010-05-18 at 17:08 -0400, Heikki Linnakangas wrote:
> On 17/05/10 12:36, Jim Nasby wrote:
> > On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote:
> >> What exactly is the user trying to monitor? If it's "how far behind is
> >> the standby", the difference between pg_current_xlog_insert_location()
> >> in the master and pg_last_xlog_replay_location() in the standby seems
> >> more robust and well-defined to me. It's a measure of XLOG location (ie.
> >> bytes) instead of time, but time is a complicated concept.
> >
> > I can tell you that end users *will* want a time-based indication of how far behind we are. DBAs will understand "we're this many transactions behind", but managers and end users won't. Unless it's unreasonable to provide that info, we should do so.
>
> No doubt about that, the problem is that it's hard to provide a reliable
> time-based indication.

I think I have one now.

--
Simon Riggs www.2ndQuadrant.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-18 21:25:46
Message-ID:	4BF305DA.4040002@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 18/05/10 17:17, Simon Riggs wrote:
> There's no reason that the buffer size we use for XLogRead() should be
> the same as the send buffer, if you're worried about that. My point is
> that pq_putmessage contains internal flushes so at the libpq level you
> gain nothing by big batches. The read() will be buffered anyway with
> readahead so not sure what the issue is. We'll have to do this for sync
> rep anyway, so what's the big deal? Just do it now, once. Do we really
> want 9.1 code to differ here?

Do what? What exactly is it that you want instead, then?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-18 21:37:39
Message-ID:	1274218659.28911.4016.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote:
> On 18/05/10 17:17, Simon Riggs wrote:
> > There's no reason that the buffer size we use for XLogRead() should be
> > the same as the send buffer, if you're worried about that. My point is
> > that pq_putmessage contains internal flushes so at the libpq level you
> > gain nothing by big batches. The read() will be buffered anyway with
> > readahead so not sure what the issue is. We'll have to do this for sync
> > rep anyway, so what's the big deal? Just do it now, once. Do we really
> > want 9.1 code to differ here?
>
> Do what? What exactly is it that you want instead, then?

Read and write smaller messages, so the latency is minimised. Libpq will
send in 8192 byte packets, so writing anything larger gains nothing when
the WAL data is also chunked at exactly the same size.

--
Simon Riggs www.2ndQuadrant.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-26 22:34:59
Message-ID:	4BFDA213.70807@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 19/05/10 00:37, Simon Riggs wrote:
> On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote:
>> On 18/05/10 17:17, Simon Riggs wrote:
>>> There's no reason that the buffer size we use for XLogRead() should be
>>> the same as the send buffer, if you're worried about that. My point is
>>> that pq_putmessage contains internal flushes so at the libpq level you
>>> gain nothing by big batches. The read() will be buffered anyway with
>>> readahead so not sure what the issue is. We'll have to do this for sync
>>> rep anyway, so what's the big deal? Just do it now, once. Do we really
>>> want 9.1 code to differ here?
>>
>> Do what? What exactly is it that you want instead, then?
>
> Read and write smaller messages, so the latency is minimised. Libpq will
> send in 8192 byte packets, so writing anything larger gains nothing when
> the WAL data is also chunked at exactly the same size.

Committed with chunk size of 128 kB. I hope that's a reasonable
compromise, in the absence of any performance test data either way.

I'm weary of setting it as low as 8k, as there is some per-message
overhead. Some of that could be avoided by rearranging the loops so that
the ps display is not updated at every message as you suggested, but I
don't feel doing any extra rearrangements at this point. It would not be
hard, but it also certainly wouldn't make the code simpler.

I believe in practice 128kB is just as good as 8k from the
responsiveness point of view. If a standby is not responding, walsender
will be stuck trying to send no matter what the block size is. If it
responding, no matter how slowly, 128kB should get transferred pretty
quickly.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-26 22:44:24
Message-ID:	1274913864.6203.3688.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-05-27 at 01:34 +0300, Heikki Linnakangas wrote:

> Committed with chunk size of 128 kB.

Thanks. I'm sure that's fine.

--
Simon Riggs www.2ndQuadrant.com

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-26 22:45:09
Message-ID:	4BFDA475.9000404@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Committed with chunk size of 128 kB. I hope that's a reasonable
> compromise, in the absence of any performance test data either way.

So where are we with max_standby_delay? Status check?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-26 22:47:58
Message-ID:	1274914078.6203.3692.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2010-05-16 at 17:11 +0100, Simon Riggs wrote:

> New version, with some other cleanup of wait processing.
>
> New logic is that when Startup asks for next applychunk of WAL it saves
> the lastChunkTimestamp. That is then the base time used by
> WaitExceedsMaxStandbyDelay(), except when latestXLogTime is later.
> Since multiple receivechunks can arrive from primary before Startup asks
> for next applychunk we use the oldest receivechunk timestamp, not the
> latest. Doing it this way means the lastChunkTimestamp doesn't change
> when new keepalives arrive, so we have a stable and reasonably accurate
> recordSendTimestamp for each WAL record.
>
> The keepalive is sent as the first part of a new message, if any. So
> partial chunks of data always have an accurate timestamp, even if that
> is slightly older as a result. Doesn't make much difference except with
> very large chunks.
>
> I think that addresses the points raised on this thread and others.

Was about to post v3 after your last commit, but just found a minor bug
in my v2->v3 changes, even though they were fairly light. Too tired to
fix now. The general thinking underlying this patch is still the same
though and is worth discussing over next few days.

--
Simon Riggs www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-26 22:52:04
Message-ID:	1274914324.6203.3696.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-05-26 at 15:45 -0700, Josh Berkus wrote:
> > Committed with chunk size of 128 kB. I hope that's a reasonable
> > compromise, in the absence of any performance test data either way.
>
> So where are we with max_standby_delay? Status check?

Just this second posted about that, as it turns out.

I have a v3 *almost* ready of the keepalive patch. It still makes sense
to me after a few days reflection, so is worth discussion and review. In
or out, I want this settled within a week. Definitely need some R&R
here.

--
Simon Riggs www.2ndQuadrant.com

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-26 23:22:40
Message-ID:	4BFDAD40.4040107@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Just this second posted about that, as it turns out.
>
> I have a v3 *almost* ready of the keepalive patch. It still makes sense
> to me after a few days reflection, so is worth discussion and review. In
> or out, I want this settled within a week. Definitely need some R&R
> here.

Does the keepalive fix all the issues with max_standby_delay? Tom?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-27 17:26:26
Message-ID:	1274981186.4405.100.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote:
> > Just this second posted about that, as it turns out.
> >
> > I have a v3 *almost* ready of the keepalive patch. It still makes sense
> > to me after a few days reflection, so is worth discussion and review. In
> > or out, I want this settled within a week. Definitely need some R&R
> > here.
>
> Does the keepalive fix all the issues with max_standby_delay? Tom?

OK, here's v4.

Summary

* WALSender adds a timestamp onto the header of every WAL chunk sent.

* Each WAL record now has a conceptual "send timestamp" that remains
constant while that record is replayed. This is used as the basis from
which max_standby_delay is calculated when required during replay.

* Send timestamp is calculated as the later of the timestamp of chunk in
which WAL record was sent and the latest XLog time.

* WALSender sends an empty message as a keepalive when nothing else to
send. (No longer a special message type for the keepalive).

I think its close, but if there's a gaping hole here somewhere then I'll
punt for this release.

--
Simon Riggs www.2ndQuadrant.com

Attachment	Content-Type	Size
hs_keepalive.v4.patch	text/x-patch	17.0 KB

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-05-31 18:40:02
Message-ID:	201005311840.o4VIe2r28048@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Uh, we have three days before we package 9.0beta2. It would be good if
we could decide on the max_standby_delay issue soon.

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote:
> > > Just this second posted about that, as it turns out.
> > >
> > > I have a v3 *almost* ready of the keepalive patch. It still makes sense
> > > to me after a few days reflection, so is worth discussion and review. In
> > > or out, I want this settled within a week. Definitely need some R&R
> > > here.
> >
> > Does the keepalive fix all the issues with max_standby_delay? Tom?
>
> OK, here's v4.
>
> Summary
>
> * WALSender adds a timestamp onto the header of every WAL chunk sent.
>
> * Each WAL record now has a conceptual "send timestamp" that remains
> constant while that record is replayed. This is used as the basis from
> which max_standby_delay is calculated when required during replay.
>
> * Send timestamp is calculated as the later of the timestamp of chunk in
> which WAL record was sent and the latest XLog time.
>
> * WALSender sends an empty message as a keepalive when nothing else to
> send. (No longer a special message type for the keepalive).
>
> I think its close, but if there's a gaping hole here somewhere then I'll
> punt for this release.
>
> --
> Simon Riggs www.2ndQuadrant.com

[ Attachment, skipping... ]

>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-01 10:36:59
Message-ID:	4C04E2CB.90304@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 27/05/10 20:26, Simon Riggs wrote:
> On Wed, 2010-05-26 at 16:22 -0700, Josh Berkus wrote:
>>> Just this second posted about that, as it turns out.
>>>
>>> I have a v3 *almost* ready of the keepalive patch. It still makes sense
>>> to me after a few days reflection, so is worth discussion and review. In
>>> or out, I want this settled within a week. Definitely need some R&R
>>> here.
>>
>> Does the keepalive fix all the issues with max_standby_delay? Tom?
>
> OK, here's v4.
>
> Summary
>
> * WALSender adds a timestamp onto the header of every WAL chunk sent.
>
> * Each WAL record now has a conceptual "send timestamp" that remains
> constant while that record is replayed. This is used as the basis from
> which max_standby_delay is calculated when required during replay.
>
> * Send timestamp is calculated as the later of the timestamp of chunk in
> which WAL record was sent and the latest XLog time.
>
> * WALSender sends an empty message as a keepalive when nothing else to
> send. (No longer a special message type for the keepalive).
>
> I think its close, but if there's a gaping hole here somewhere then I'll
> punt for this release.

This certainly alleviates some of the problems. You still need to ensure
that master and standby have synchronized clocks, and you still get zero
grace time after a long period of inactivity when not using streaming
replication, however.

Sending a keep-alive message every 100ms seems overly aggressive to me.

If we really want to try to salvage max_standby_delay with a meaning
similar to what it has now, I think we should go with the idea some
people bashed around earlier and define the grace period as the
difference between a WAL record becoming available to the standby for
replay, and between replaying it. An approximation of that is to do
"lastIdle = gettimeofday()" in XLogPageRead() whenever it needs to wait
for new WAL to arrive, whether that's via streaming replication or by a
success return code from restore_command, and compare the difference of
that with current timestamp in WaitExceedsMaxStandbyDelay().

That's very simple, doesn't require synchronized clocks, and works the
same with file- and stream-based setups.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-01 11:04:48
Message-ID:	1275390288.21465.279.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks for the review.

On Tue, 2010-06-01 at 13:36 +0300, Heikki Linnakangas wrote:

> If we really want to try to salvage max_standby_delay with a meaning
> similar to what it has now, I think we should go with the idea some
> people bashed around earlier and define the grace period as the
> difference between a WAL record becoming available to the standby for
> replay, and between replaying it. An approximation of that is to do
> "lastIdle = gettimeofday()" in XLogPageRead() whenever it needs to wait
> for new WAL to arrive, whether that's via streaming replication or by a
> success return code from restore_command, and compare the difference of
> that with current timestamp in WaitExceedsMaxStandbyDelay().

That wouldn't cope with a continuous stream of records arriving, unless
you also include the second half of the patch.

> That's very simple, doesn't require synchronized clocks, and works the
> same with file- and stream-based setups.

Nor does it provide a mechanism for monitoring of SR. standby_delay is
explicitly defined in terms of the gap between two servers, so is a
useful real world concept. apply_delay is somewhat less interesting.

I'm sure most people would rather have monitoring and therefore the
requirement for synchronised-ish clocks, than no monitoring. If you
think no monitoring is OK, I don't, but there are other ways, so its not
a point to fight about.

> This certainly alleviates some of the problems. You still need to ensure
> that master and standby have synchronized clocks, and you still get zero
> grace time after a long period of inactivity when not using streaming
> replication, however.

Second issue can be added once we approve the rest of this if you like.

> Sending a keep-alive message every 100ms seems overly aggressive to me.

It's sent every wal_sender_delay. Why is that a negative?

--
Simon Riggs www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 16:41:53
Message-ID:	1275496913.21465.2704.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote:

> Uh, we have three days before we package 9.0beta2. It would be good if
> we could decide on the max_standby_delay issue soon.

I've heard something from Heikki, not from anyone else. Those comments
amount to "lets replace max_standby_delay with max_apply_delay".

Got a beta idea?

--
Simon Riggs www.2ndQuadrant.com

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 16:59:05
Message-ID:	4C0647890200002500031CDA@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
> On Mon, 2010-05-31 at 14:40 -0400, Bruce Momjian wrote:
>
>> Uh, we have three days before we package 9.0beta2. It would be
>> good if we could decide on the max_standby_delay issue soon.
>
> I've heard something from Heikki, not from anyone else. Those
> comments amount to "lets replace max_standby_delay with
> max_apply_delay".
>
> Got a beta idea?

Given the incessant ticking of the clock, I have a hard time
believing we have any real options besides max_standby_delay or a
boolean which corresponds to the -1 and 0 settings of
max_standby_delay. I think it's pretty clear that there's a use
case for the positive values, although there are bound to be some
who try it and are surprised by behavior at transition from idle to
active. The whole debate seems to boil down to how important a
middle ground is versus how damaging the surprise factor is. (I
don't really buy the argument that we won't be able to remove it
later if we replace it with something better.)

I know there were initially some technical problems, too; have
those been resolved?

-Kevin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 17:14:33
Message-ID:	8362.1275498873@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> OK, here's v4.

I've been trying to stay out of this thread, but with beta2 approaching
and no resolution in sight, I'm afraid I have to get involved.

This patch seems to me to be going in fundamentally the wrong direction.
It's adding complexity and overhead (far more than is needed), and it's
failing utterly to resolve the objections that I raised to start with.

In particular, Simon seems to be basically refusing to do anything about
the complaint that the code fails unless master and standby clocks are
in close sync. I do not believe that this is acceptable, and since he
won't fix it, I guess I'll have to.

The other basic problem that I had with the current code, which this
does nothing about, is that it believes that timestamps pulled from WAL
archive should be treated on the same footing as timestamps of "live"
actions. That might work in certain limited scenarios but it's going to
be a disaster in general.

I believe that the motivation for treating archived timestamps as live
is, essentially, to force rapid catchup if a slave falls behind so far
that it's reading from archive instead of SR. There are certainly
use-cases where that's appropriate (though, again, not all); but even
when you do want it it's a pretty inflexible implementation. For
realistic values of max_standby_delay the behavior is going to pretty
much be "instant kill whenever we're reading from archive".

I have backed off my original suggestion that we should reduce
max_standby_delay to a boolean: that was based on the idea that delays
would only occur in response to DDL on the master, but since vacuum and
btree page splits can also trigger delays, it seems clear that a "kill
queries immediately" policy isn't really very useful in practice. So we
need to make max_standby_delay work rather than just get rid of it.

What I think might be a realistic compromise is this:

1. Separate max_standby_delay into two GUCs, say "max_streaming_delay"
and "max_archive_delay".

2. When applying WAL that came across SR, use max_streaming_delay and
let the time measurement be current time minus time of receipt of the
current WAL send chunk.

3. When applying WAL that came from archive, use max_archive_delay and
let the time measurement be current time minus time of acquisition of
the current WAL segment from the archive.

The current code's behavior in the latter case could effectively be
modeled by setting max_archive_delay to zero, but that isn't the only
plausible setting. More likely DBAs would set max_archive_delay to
something smaller than max_streaming_delay, but still positive so as to
not kill conflicting queries instantly.

An important property of this design is that all relevant timestamps
are taken on the slave, so clock skew isn't an issue.

I'm still inclined to apply the part of Simon's patch that adds a
transmit timestamp to each SR send chunk. That would actually be
completely unused by the slave given my proposal above, but I think that
it is an important step to take to future-proof the SR protocol against
possible changes in the slave-side timing logic. I don't however see
the value of transmitting "keepalive" records when we'd otherwise not
have anything to send. The value of adding timestamps to the SR
protocol is really to let the slave determine the current amount of
clock skew between it and the master; which is a number that should not
change so rapidly that it has to be updated every 100ms even in an idle
system.

Comments?

regards, tom lane

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 17:36:52
Message-ID:	20100602173652.GL21875@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> An important property of this design is that all relevant timestamps
> are taken on the slave, so clock skew isn't an issue.

I agree that this is important, and I do run NTP on all my servers and
even monitor it using Nagios.

It's still not a cure-all for time skew issues.

> Comments?

I'm not really a huge fan of adding another GUC, to be honest. I'm more
inclined to say we treat 'max_archive_delay' as '0', and turn
max_streaming_delay into what you've described. If we fall back so far
that we have to go back to reading WALs, then we need to hurry up and
catch-up and damn the torpedos. I'd also prefer that we only wait the
delay time once until we're fully caught up again (and have gotten
back around to waiting for new data).

Thanks,

Stephen

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 17:44:26
Message-ID:	4C06987A.5040306@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> I'm still inclined to apply the part of Simon's patch that adds a
> transmit timestamp to each SR send chunk. That would actually be
> completely unused by the slave given my proposal above, but I think that
> it is an important step to take to future-proof the SR protocol against
> possible changes in the slave-side timing logic.
>

+1.

From a radically different perspective, I had to do something similar
in the buildfarm years ago to protect us from machines reporting with
grossly inaccurate timestamps. This was part of the solution. The client
adds its current timestamp setting just before transmitting the data to
the server.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 17:45:56
Message-ID:	8897.1275500756@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
>> Comments?

> I'm not really a huge fan of adding another GUC, to be honest. I'm more
> inclined to say we treat 'max_archive_delay' as '0', and turn
> max_streaming_delay into what you've described. If we fall back so far
> that we have to go back to reading WALs, then we need to hurry up and
> catch-up and damn the torpedos.

If I thought that 0 were a generally acceptable value, I'd still be
pushing the "simplify it to a boolean" agenda ;-). The problem is that
that will sometimes kill standby queries even when they are quite short
and doing nothing objectionable.

> I'd also prefer that we only wait the
> delay time once until we're fully caught up again (and have gotten
> back around to waiting for new data).

The delays will be measured from a receipt instant to current time,
which means that the longer it takes to apply a WAL segment or WAL
send chunk, the less grace period there will be. (Which is the
same as what CVS HEAD does --- I'm just arguing about where we get
the start time from.) I believe this does what you suggest and more.

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 18:03:50
Message-ID:	1275501830.21465.2774.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote:
> Stephen Frost <sfrost(at)snowman(dot)net> writes:
> > * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> >> Comments?
>
> > I'm not really a huge fan of adding another GUC, to be honest. I'm more
> > inclined to say we treat 'max_archive_delay' as '0', and turn
> > max_streaming_delay into what you've described. If we fall back so far
> > that we have to go back to reading WALs, then we need to hurry up and
> > catch-up and damn the torpedos.
>
> If I thought that 0 were a generally acceptable value, I'd still be
> pushing the "simplify it to a boolean" agenda ;-). The problem is that
> that will sometimes kill standby queries even when they are quite short
> and doing nothing objectionable.

OK, now I understand. I was just thinking the same as Stephen, but now I
agree we need a second parameter.

--
Simon Riggs www.2ndQuadrant.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 18:13:19
Message-ID:	AANLkTilZJVLqHJr6Ic8x74vQmgnMJIfiZ1E7yAUaLtU1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 2, 2010 at 2:03 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, 2010-06-02 at 13:45 -0400, Tom Lane wrote:
>> Stephen Frost <sfrost(at)snowman(dot)net> writes:
>> > * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
>> >> Comments?
>>
>> > I'm not really a huge fan of adding another GUC, to be honest. I'm more
>> > inclined to say we treat 'max_archive_delay' as '0', and turn
>> > max_streaming_delay into what you've described. If we fall back so far
>> > that we have to go back to reading WALs, then we need to hurry up and
>> > catch-up and damn the torpedos.
>>
>> If I thought that 0 were a generally acceptable value, I'd still be
>> pushing the "simplify it to a boolean" agenda ;-). The problem is that
>> that will sometimes kill standby queries even when they are quite short
>> and doing nothing objectionable.
>
> OK, now I understand. I was just thinking the same as Stephen, but now I
> agree we need a second parameter.

I too was thinking the same as Stephen, but now I also agree we need a
second parameter. I think the whole design Tom proposed seems good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 18:23:44
Message-ID:	AANLkTilSRAyjORRQOxLFFShXuE3UZqZ8QPQfeeGKtybX@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I believe that the motivation for treating archived timestamps as live
> is, essentially, to force rapid catchup if a slave falls behind so far
> that it's reading from archive instead of SR.

Huh, I think this is the first mention of this that I've seen. I
always assumed the motivation was just that you wanted to control how
much data loss could occur on failover and how long recovery would
take. I think separating the two delays is an interesting idea but I
don't see how it counts as a simplification.

This also still allows a slave to become arbitrarily far behind the
master. The master might take a long time to fill up a log file, and
if the slave is blocked for a few minutes, then requests the next bit
of log file and blocks for a few more minutes, then requests the next
log file and blocks for a few more minutes, etc. As long as the slave
is reading more slowly than the master is filling those segments it
will fall further and further behind but never be more than a few
minutes behind receiving the logs.

I propose an alternate way out of the problem of syncing two clocks.
Instead of comparing timestamps compare time intervals. So as it reads
xlog records it only ever compares the master timestamps with previous
master timestamps to determine how much time has elapsed on the
master. It compares that time elapsed with the time elapsed on the
slave to determine if it's falling behind. I'm assuming it
periodically catches up and can throw out its accumulated time elapsed
and start fresh. If not then the accumulated error might be a problem,
but hopefully one we can find a solution to.

--
greg

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 18:27:37
Message-ID:	1275503257.21465.2805.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote:

> This patch seems to me to be going in fundamentally the wrong direction.
> It's adding complexity and overhead (far more than is needed), and it's
> failing utterly to resolve the objections that I raised to start with.

Having read your proposal, it seems changing from time-on-sender to
time-on-receiver is a one line change to the patch.

What else are you thinking of removing, if anything?

Adding an extra parameter adds more obviously and is something I now
agree with.

> In particular, Simon seems to be basically refusing to do anything about
> the complaint that the code fails unless master and standby clocks are
> in close sync. I do not believe that this is acceptable, and since he
> won't fix it, I guess I'll have to.

Syncing two servers in replication is common practice, as has been
explained here; I'm still surprised people think otherwise. Measuring
the time between two servers is the very purpose of the patch, so the
synchronisation is not a design flaw, it is its raison d'etre. There's
been a few spleens emptied on that topic, not all of them mine, and
certainly no consensus on that. So I'm not refusing to do anything
that's been agreed...

--
Simon Riggs www.2ndQuadrant.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 18:35:10
Message-ID:	AANLkTimn-Q3QFejMk7px6x3PPvec2EowDNA6l-Nka9J6@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 2, 2010 at 2:27 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Syncing two servers in replication is common practice, as has been
> explained here; I'm still surprised people think otherwise. Measuring
> the time between two servers is the very purpose of the patch, so the
> synchronisation is not a design flaw, it is its raison d'etre.

I think the purpose of the patch should not be to measure the time
difference between servers, but rather the replication delay. While I
don't necessarily agree with Tom's statement that this is must-fix, I
do agree that it would be nicer if we could avoid depending on time
sync. Yeah, I keep my servers time synced, too. But, shit happens.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	"caught_up" status in walsender
Date:	2010-06-02 18:44:43
Message-ID:	9893.1275504283@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> I'm still inclined to apply the part of Simon's patch that adds a
> transmit timestamp to each SR send chunk. That would actually be
> completely unused by the slave given my proposal above, but I think that
> it is an important step to take to future-proof the SR protocol against
> possible changes in the slave-side timing logic.

On further contemplation, it seems like the protocol needs another field
besides that: each record should also carry a boolean indicating whether
walsender.c thinks it is currently "caught up", ie the record carries
all WAL data up to the current end of WAL. If the sender is not caught
up, then the receiver should apply max_archive_delay not
max_streaming_delay. In this way, WAL chunks that are a little bit
behind current time will be treated the same way whether they come
across the SR link or via the archive.

In conjunction with that, I think there's a logic bug in XLogSend;
it ought to be changed like so:

/* if we went beyond SendRqstPtr, back off */
if (XLByteLT(SendRqstPtr, endptr))
+ {
endptr = SendRqstPtr;
+ *caughtup = true;
+ }

In the current coding, the effect of not setting *caughtup here is just
that we uselessly call XLogSend an extra time for each transmission
(because the main loop won't ever delay immediately after a
transmission). But without this, we'd never send caughtup = true
to the slave.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 19:14:24
Message-ID:	10389.1275506064@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> On Wed, Jun 2, 2010 at 6:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I believe that the motivation for treating archived timestamps as live
>> is, essentially, to force rapid catchup if a slave falls behind so far
>> that it's reading from archive instead of SR.

> Huh, I think this is the first mention of this that I've seen. I
> always assumed the motivation was just that you wanted to control how
> much data loss could occur on failover and how long recovery would
> take. I think separating the two delays is an interesting idea but I
> don't see how it counts as a simplification.

Well, it isn't a simplification: it's bringing it up to the minimum
complication level where it'll actually work sanely. The current
implementation doesn't work sanely because it confuses stale timestamps
read from WAL with real live time.

> This also still allows a slave to become arbitrarily far behind the
> master.

Indeed, but nothing we do can prevent that, if the slave is just plain
slower than the master. You have to assume that the slave is capable of
keeping up in the absence of query-caused delays, or you're hosed.

The real reason this is at issue is the fact that the max_standby_delay
kill mechanism applies to certain buffer-level locking operations.
On the master we just wait, and it's not a problem, because in practice
the conflicting queries almost always release these locks pretty quick.
On the slave, though, instant kill as a result of a buffer-level lock
conflict would result in a very serious degradation in standby query
reliability (while also doing practically nothing for the speed of WAL
application, most of the time). This morning on the phone Bruce and I
were seriously discussing the idea of ripping the max_standby_delay
mechanism out of the buffer-level locking paths, and just let them work
like they do on the master, ie, wait forever. If we did that then
simplifying max_standby_delay to a boolean would be reasonable again
(because it really would only come into play for DDL on the master).
The sticky point is that once in a blue moon you do have a conflicting
query sitting on a buffer lock for a long time, or even more likely a
series of queries keeping the WAL replay process from obtaining buffer
cleanup lock.

So it seems that we have to have max_standby_delay-like logic for those
locks, and also that zero grace period before kill isn't a very practical
setting. However, there isn't a lot of point in obsessing over exactly
how long the grace period ought to be, as long as it's more than a few
milliseconds. It *isn't* going to have any real effect on whether the
slave can stay caught up. You could make a fairly decent case for just
measuring the grace period from when the replay process starts to wait,
as I think I proposed awhile back. The value of measuring delay from a
receipt time is that if you do happen to have a bunch of delays within
a short interval you'll get more willing to kill queries --- but I
really believe that that is a corner case and will have nothing to do
with ordinary performance.

> I propose an alternate way out of the problem of syncing two clocks.
> Instead of comparing timestamps compare time intervals. So as it reads
> xlog records it only ever compares the master timestamps with previous
> master timestamps to determine how much time has elapsed on the
> master. It compares that time elapsed with the time elapsed on the
> slave to determine if it's falling behind.

I think this would just add complexity and uncertainty, to deal with
something that won't be much of a problem in practice.

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: "caught_up" status in walsender
Date:	2010-06-02 19:14:45
Message-ID:	4C06ADA5.1040409@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/06/10 21:44, Tom Lane wrote:
> In conjunction with that, I think there's a logic bug in XLogSend;
> it ought to be changed like so:
>
> /* if we went beyond SendRqstPtr, back off */
> if (XLByteLT(SendRqstPtr, endptr))
> + {
> endptr = SendRqstPtr;
> + *caughtup = true;
> + }
>
> In the current coding, the effect of not setting *caughtup here is just
> that we uselessly call XLogSend an extra time for each transmission
> (because the main loop won't ever delay immediately after a
> transmission). But without this, we'd never send caughtup = true
> to the slave.

That's intentional. It could take some time for the WAL to be sent, if
the network is busy, so by the time XLogSend returns you might well not
be caught up anymore.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: "caught_up" status in walsender
Date:	2010-06-02 19:21:36
Message-ID:	10535.1275506496@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 02/06/10 21:44, Tom Lane wrote:
>> In the current coding, the effect of not setting *caughtup here is just
>> that we uselessly call XLogSend an extra time for each transmission
>> (because the main loop won't ever delay immediately after a
>> transmission). But without this, we'd never send caughtup = true
>> to the slave.

> That's intentional. It could take some time for the WAL to be sent, if
> the network is busy, so by the time XLogSend returns you might well not
> be caught up anymore.

It may have been intentional, but it's still wrong. If you were able to
pull all of WAL into the record-to-be-sent, you should sleep afterwards,
not send an extra record containing a few more bytes.

regards, tom lane

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: "caught_up" status in walsender
Date:	2010-06-02 19:40:21
Message-ID:	m2ocft5il6.fsf@hi-media.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> I wrote:
>> I'm still inclined to apply the part of Simon's patch that adds a
>> transmit timestamp to each SR send chunk. That would actually be
>> completely unused by the slave given my proposal above, but I think that
>> it is an important step to take to future-proof the SR protocol against
>> possible changes in the slave-side timing logic.
>
> On further contemplation, it seems like the protocol needs another field
> besides that: each record should also carry a boolean indicating whether
> walsender.c thinks it is currently "caught up", ie the record carries
> all WAL data up to the current end of WAL. If the sender is not caught
> up, then the receiver should apply max_archive_delay not
> max_streaming_delay. In this way, WAL chunks that are a little bit
> behind current time will be treated the same way whether they come
> across the SR link or via the archive.

Then we'd have max_catchup_delay and max_streaming_delay, wouldn't we?

I'm still trying to understand the implications of the proposal, which
sounds good but… given just enough load on the slave, wouldn't it be
playing catchup all the time? Wouldn't enough load mean one read query
per write commit on the master?

Regards,
--
dim

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 19:41:49
Message-ID:	4C06B3FD.4090709@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/06/10 20:14, Tom Lane wrote:
> For realistic values of max_standby_delay ...

Hang on right there. What do you consider a realistic value for
max_standby_delay? Because I'm not sure I have a grip on that myself. 5
seconds? 5 minutes? 5 hours? I can see use cases for all of those...

> What I think might be a realistic compromise is this:
>
> 1. Separate max_standby_delay into two GUCs, say "max_streaming_delay"
> and "max_archive_delay".
>
> 2. When applying WAL that came across SR, use max_streaming_delay and
> let the time measurement be current time minus time of receipt of the
> current WAL send chunk.
>
> 3. When applying WAL that came from archive, use max_archive_delay and
> let the time measurement be current time minus time of acquisition of
> the current WAL segment from the archive.
>
> The current code's behavior in the latter case could effectively be
> modeled by setting max_archive_delay to zero, but that isn't the only
> plausible setting. More likely DBAs would set max_archive_delay to
> something smaller than max_streaming_delay, but still positive so as to
> not kill conflicting queries instantly.

The problem with defining max_archive_delay that way is again that you
can fall behind indefinitely. If you set it to 5 minutes, it means that
you'll wait a maximum of 5 minutes *per WAL segment*, even if WAL is
being generated faster.

I don't understand why you want to use a different delay when you're
restoring from archive vs. when you're streaming (what about existing
WAL files found in pg_xlog, BTW?). The source of WAL shouldn't make a
difference. If it's because you assume that restoring from archive is a
sign that you've fallen behind a lot, surely you've exceeded
max_standby_delay then and I still don't see a need for a separate GUC.

I stand by my suggestion from yesterday: Let's define max_standby_delay
as the difference between a piece of WAL becoming available in the
standby, and applying it. To approximate "piece of WAL becoming
available" for SR, we can use the mechanism with send/applyChunks from
Simon's latest patch, or go with the simpler scheme of just resetting a
"last caughtup timestamp" to current time whenever we have to wait for
new WAL to arrive. When restoring from archive, likewise reset "last
caughtup timestamp" whenever restore_command returns non-0, i.e we have
to wait for the next WAL file to arrive.

That works the same for both SR and file-based log shipping, there's
only one knob to set, is simple to implement and doesn't require
synchronized clocks.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: "caught_up" status in walsender
Date:	2010-06-02 19:50:17
Message-ID:	11008.1275508217@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Dimitri Fontaine <dfontaine(at)hi-media(dot)com> writes:
> I'm still trying to understand the implications of the proposal, which
> sounds good but given just enough load on the slave, wouldn't it be
> playing catchup all the time?

Uh, if the slave is overloaded, *any* implementation will be playing
catchup all the time. Not sure about your point here.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 20:00:47
Message-ID:	11199.1275508847@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> The problem with defining max_archive_delay that way is again that you
> can fall behind indefinitely.

See my response to Greg Stark --- I don't think this is really an
issue. It's certainly far less of an issue than the current situation
that query grace periods go to zero under not-at-all-extreme conditions.

> I don't understand why you want to use a different delay when you're
> restoring from archive vs. when you're streaming (what about existing
> WAL files found in pg_xlog, BTW?).

You're missing the point. I want the DBA to be able to control what
happens in those two cases. In the current implementation he has no
control over what happens while restoring from archive: it's going to
effectively act like max_archive_delay = 0 all the time. If you're of
the opinion that's good, you can set the parameter that way and be
happy. I'm of the opinion you'll soon find out it isn't so good,
because it'll kill standby queries too easily.

> I stand by my suggestion from yesterday: Let's define max_standby_delay
> as the difference between a piece of WAL becoming available in the
> standby, and applying it.

My proposal is essentially the same as yours plus allowing the DBA to
choose different max delays for the caught-up and not-caught-up cases.
Maybe everybody will end up setting the two delays the same, but I think
we don't have enough experience to decide that for them now.

regards, tom lane

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: "caught_up" status in walsender
Date:	2010-06-02 20:20:08
Message-ID:	m2bpbt426f.fsf@hi-media.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Uh, if the slave is overloaded, *any* implementation will be playing
> catchup all the time. Not sure about your point here.

Well, sorry, I missed the part where the catchup is measured on
walsender side of things. Now that means that a non interrupted flow of
queries quicker than the wal sender delay (100ms, right?) on the slave
would certainly leave enough room for it to stay current.

Well that depends also on the time it takes to replay the wals.

I'm trying to decide if exposing this 100ms magic setup (or something
else) would allow for a lot more control with respect to what means
being overloaded.

Say you set the 100ms delay to any value that you know (from testing and
log_min_duration_statement, say) just a little higher than the slowest
query you typically run on the slave. If that reduces the chances to
ever playing cathing-up (as soon as there's no unexpected slow query
there), that would be great.

If you can manage to make sense of this, I'm interested into hearing how
far it is from reality.

Regards,
--
dim

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: "caught_up" status in walsender
Date:	2010-06-02 20:53:36
Message-ID:	AANLkTiknR4P8P7TVf7cNJu4eRBwNhr-qliKPhK8BQedz@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 2, 2010 at 2:44 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> I'm still inclined to apply the part of Simon's patch that adds a
>> transmit timestamp to each SR send chunk. That would actually be
>> completely unused by the slave given my proposal above, but I think that
>> it is an important step to take to future-proof the SR protocol against
>> possible changes in the slave-side timing logic.
>
> On further contemplation, it seems like the protocol needs another field
> besides that: each record should also carry a boolean indicating whether
> walsender.c thinks it is currently "caught up", ie the record carries
> all WAL data up to the current end of WAL. If the sender is not caught
> up, then the receiver should apply max_archive_delay not
> max_streaming_delay. In this way, WAL chunks that are a little bit
> behind current time will be treated the same way whether they come
> across the SR link or via the archive.

I'm not sure that makes sense. I thought the point of separating the
archive and streaming cases was that the same timeout wouldn't
necessarily be correct for a 16MB WAL file as it would for a 16-byte
WAL record you've just received. IOW, you might want
max_archive_delay > max_streaming_delay. The original proposal also
seems somewhat easier to understand, to me at least.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 22:59:23
Message-ID:	AANLkTiljJpCHPLhJqcAJvYevskZ5RBeqt-JqBAchX2AL@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 2, 2010 at 8:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Indeed, but nothing we do can prevent that, if the slave is just plain
> slower than the master. You have to assume that the slave is capable of
> keeping up in the absence of query-caused delays, or you're hosed.

I was assuming the walreceiver only requests more wal in relatively
small chunks and only when replay has caught up and needs more data. I
haven't actually read this code so if that assumption is wrong then
I'm off-base. But if my assumption is right then it's not merely the
master running faster than the slave that can cause you to fall
arbitrarily far behind. The "receive time" is delayed by how long the
slave waited on a previous lock, so if we wait for a few minutes, then
proceed and find we need more wal data we'll read data from the master
that it could have generated those few minutes ago.

> The sticky point is that once in a blue moon you do have a conflicting
> query sitting on a buffer lock for a long time, or even more likely a
> series of queries keeping the WAL replay process from obtaining buffer
> cleanup lock.

So I think this isn't necessarily such a blue moon event. As I
understand it, all it would take is a single long-running report and a
vacuum or HOT cleanup occurring on the master. If I want to set
max_standby_delay to 60min to allow reports to run for up to an hour
then any random HOT cleanup on the master will propagate to the slave
and cause a WAL stall until the transactions which have that xid in
their snapshot end.

Even the buffer locks are could potentially be blocked for a long time
if you happen to run the right kind of query (say a nested loop with
the buffer in question on the outside and a large table on the
inside). That's a rarer event though; is that what you were thinking
of?

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-02 23:11:08
Message-ID:	13557.1275520268@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> I was assuming the walreceiver only requests more wal in relatively
> small chunks and only when replay has caught up and needs more data. I
> haven't actually read this code so if that assumption is wrong then
> I'm off-base.

It is off-base. The receiver does not "request" data, the sender is
what determines how much WAL is sent when.

> So I think this isn't necessarily such a blue moon event. As I
> understand it, all it would take is a single long-running report and a
> vacuum or HOT cleanup occurring on the master.

I think this is mostly FUD too. How often do you see vacuum blocked for
an hour now? It probably can happen, which is why we need to be able to
kick queries off the locks with max_standby_delay, but it's far from
common. What we're concerned about here is how long a buffer lock on a
single page is held, not how long heavyweight locks are held. The
normal hold times are measured in microseconds.

regards, tom lane

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 00:47:10
Message-ID:	20100603004710.GO21875@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > So I think this isn't necessarily such a blue moon event. As I
> > understand it, all it would take is a single long-running report and a
> > vacuum or HOT cleanup occurring on the master.
>
> I think this is mostly FUD too. How often do you see vacuum blocked for
> an hour now? It probably can happen, which is why we need to be able to
> kick queries off the locks with max_standby_delay, but it's far from
> common. What we're concerned about here is how long a buffer lock on a
> single page is held, not how long heavyweight locks are held. The
> normal hold times are measured in microseconds.

Maybe I'm missing something, but I think Greg's point was that if you
have a long-running query running against the standby/slave/whatever,
which is holding locks on various relations to implement that report,
and then a vacuum or HOT update happens on the master, the long-running
report query will get killed off unless you bump max_streaming_delay up
pretty high (eg: 60 mins).

That being said, I'm not sure that there's really another solution.
Yes, in this case, the slave can end up being an hour behind, but that's
the trade-off you have to make if you want to run an hour-long query on
the slave. The other answer is to make the master not update those
tuples, etc, which might be possible by starting a transaction on the
master which grabs things enough to prevent the vacuum/hot/etc update
from happening. That may be possible manually, but it's not fun and it
certainly isn't something we'll have built-in support for in 9.0.

Thanks,

Stephen

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 07:56:54
Message-ID:	1275551814.21465.2841.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-06-02 at 16:00 -0400, Tom Lane wrote:
> the current situation that query grace periods go to zero

Possibly a better way to handle this concern is to make the second
parameter: min_standby_grace_period - the minimum time a query will be
given in which to execute, even if max_standby_delay has been reached or
exceeded.

Would that more directly address you concerns?

min_standby_grace_period (ms) SIGHUP

--
Simon Riggs www.2ndQuadrant.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 08:56:49
Message-ID:	AANLkTinAKcoyDYip7zL8Xv9STuTh5YdywSAAvQndp4iW@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I don't understand why you want to use a different delay when you're
> restoring from archive vs. when you're streaming (what about existing WAL
> files found in pg_xlog, BTW?). The source of WAL shouldn't make a
> difference.

Yes. The pace of a recovery has nothing to do with that of log shipping.
So to hurry up a recovery when restoring from archive seems to be useless.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 09:07:39
Message-ID:	1275556059.21465.2860.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-06-03 at 17:56 +0900, Fujii Masao wrote:
> On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> > I don't understand why you want to use a different delay when you're
> > restoring from archive vs. when you're streaming (what about existing WAL
> > files found in pg_xlog, BTW?). The source of WAL shouldn't make a
> > difference.
>
> Yes. The pace of a recovery has nothing to do with that of log shipping.
> So to hurry up a recovery when restoring from archive seems to be useless.

When streaming drops for some reason we revert to scanning the archive
for files. There is clearly two modes of operation. So it makes sense
that you might want to set different times for the parameter in each
case.

We might think of those modes as "connected" and "degraded" modes rather
than streaming and file shipping.

--
Simon Riggs www.2ndQuadrant.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: "caught_up" status in walsender
Date:	2010-06-03 09:18:50
Message-ID:	AANLkTikBs95qnsl00fsfrEurGXIuY9dvYJEA-ik4GShQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 3, 2010 at 4:21 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> On 02/06/10 21:44, Tom Lane wrote:
>>> In the current coding, the effect of not setting *caughtup here is just
>>> that we uselessly call XLogSend an extra time for each transmission
>>> (because the main loop won't ever delay immediately after a
>>> transmission). But without this, we'd never send caughtup = true
>>> to the slave.
>
>> That's intentional. It could take some time for the WAL to be sent, if
>> the network is busy, so by the time XLogSend returns you might well not
>> be caught up anymore.
>
> It may have been intentional, but it's still wrong. If you were able to
> pull all of WAL into the record-to-be-sent, you should sleep afterwards,
> not send an extra record containing a few more bytes.

For reducing the workload of walsender?

This seems OK in 9.0 since only asynchronous replication is supported.
But when we'll implement synchronous replication in the future, we
might have to revert that change. Since a transaction commit might wait
for such an extra record to be replicated, walsender should aggressively
send all sendable WAL.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 09:39:48
Message-ID:	AANLkTinUrN_XO5dRqIaIKfEaMZPGd9IrwYwXz3-qVdvm@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 3, 2010 at 6:07 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Thu, 2010-06-03 at 17:56 +0900, Fujii Masao wrote:
>> On Thu, Jun 3, 2010 at 4:41 AM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> > I don't understand why you want to use a different delay when you're
>> > restoring from archive vs. when you're streaming (what about existing WAL
>> > files found in pg_xlog, BTW?). The source of WAL shouldn't make a
>> > difference.
>>
>> Yes. The pace of a recovery has nothing to do with that of log shipping.
>> So to hurry up a recovery when restoring from archive seems to be useless.
>
> When streaming drops for some reason we revert to scanning the archive
> for files. There is clearly two modes of operation.

Yes.

> So it makes sense
> that you might want to set different times for the parameter in each
> case.

What purpose would that serve?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 10:57:33
Message-ID:	1275562653.21465.2868.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-06-03 at 18:39 +0900, Fujii Masao wrote:

> What purpose would that serve?

Tom has already explained this and it makes sense for me.

--
Simon Riggs www.2ndQuadrant.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 12:14:53
Message-ID:	AANLkTilx2PP8QHxSu1CCsaZsj9BcOnT9Nq4TMwL-7u35@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 3, 2010 at 5:00 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I stand by my suggestion from yesterday: Let's define max_standby_delay
>> as the difference between a piece of WAL becoming available in the
>> standby, and applying it.
>
> My proposal is essentially the same as yours plus allowing the DBA to
> choose different max delays for the caught-up and not-caught-up cases.
> Maybe everybody will end up setting the two delays the same, but I think
> we don't have enough experience to decide that for them now.

Applying WAL that arrives via SR is not always the sign of the caught-up
or not-caught-up. OTOH, applying WAL restored from archive is not always
the sign of either of them. So isn't it nonsense to separate the delay in
order to control the behavior of a recovery for those cases?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: "caught_up" status in walsender
Date:	2010-06-03 13:47:19
Message-ID:	23830.1275572839@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> On Thu, Jun 3, 2010 at 4:21 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>>> That's intentional. It could take some time for the WAL to be sent, if
>>> the network is busy, so by the time XLogSend returns you might well not
>>> be caught up anymore.
>>
>> It may have been intentional, but it's still wrong. If you were able to
>> pull all of WAL into the record-to-be-sent, you should sleep afterwards,
>> not send an extra record containing a few more bytes.

> For reducing the workload of walsender?

> This seems OK in 9.0 since only asynchronous replication is supported.
> But when we'll implement synchronous replication in the future, we
> might have to revert that change. Since a transaction commit might wait
> for such an extra record to be replicated, walsender should aggressively
> send all sendable WAL.

It *is* aggressively sending all sendable WAL. The ideal steady state
behavior of this loop ought to be that once per sleep interval, we send
out one record containing all new WAL since the last time. We do not
want it sending 10000 bytes, then another record with 100 bytes, then
another record with 10 bytes, etc etc. That's inefficient and
ultimately pointless. You'll always be behind again a millisecond
later.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 15:07:10
Message-ID:	AANLkTikmXYcu6brR3aqwGxcW6j5I9QhMPo7L7Bkw0kP7@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
>> I was assuming the walreceiver only requests more wal in relatively
>> small chunks and only when replay has caught up and needs more data. I
>> haven't actually read this code so if that assumption is wrong then
>> I'm off-base.
>
> It is off-base. The receiver does not "request" data, the sender is
> what determines how much WAL is sent when.

Hm, so what happens if the slave blocks, doesn't the sender block when
the kernel buffers fill up?

>> So I think this isn't necessarily such a blue moon event. As I
>> understand it, all it would take is a single long-running report and a
>> vacuum or HOT cleanup occurring on the master.
>
> I think this is mostly FUD too. How often do you see vacuum blocked for
> an hour now?

No, that's not comparable. On the master vacuum will just ignore
tuples that are still visible to live transactions. On the slave it
doesn't have a choice, it sees the cleanup record and must pause
recovery until those transactions finish.

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 15:18:06
Message-ID:	26671.1275578286@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> It is off-base. The receiver does not "request" data, the sender is
>> what determines how much WAL is sent when.

> Hm, so what happens if the slave blocks, doesn't the sender block when
> the kernel buffers fill up?

Well, if the slave can't keep up, that's a separate problem. It will
not fail to keep up as a result of the transmission mechanism.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: "caught_up" status in walsender
Date:	2010-06-03 15:26:07
Message-ID:	26798.1275578767@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> On further contemplation, it seems like the protocol needs another field
> besides that: each record should also carry a boolean indicating whether
> walsender.c thinks it is currently "caught up", ie the record carries
> all WAL data up to the current end of WAL.

Actually, there's a better way to do that: let's have the record carry
not just a boolean but the actual current end-of-WAL LSN. The receiver
could then not just determine "am I behind" but find out *how far*
behind it is, and thereby perhaps adjust its behavior in more subtle
ways than just a binary on/off fashion.

(Actually doing anything like that is material for future work, of
course, but I think we should try to get the SR protocol right now.)

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 15:27:51
Message-ID:	AANLkTil0gaKqJjqpoyBlSD3dZnUETQM35updVcfmMQk_@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 3, 2010 at 4:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
>> On Thu, Jun 3, 2010 at 12:11 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> It is off-base. The receiver does not "request" data, the sender is
>>> what determines how much WAL is sent when.
>
>> Hm, so what happens if the slave blocks, doesn't the sender block when
>> the kernel buffers fill up?
>
> Well, if the slave can't keep up, that's a separate problem. It will
> not fail to keep up as a result of the transmission mechanism.

No, I mean if the slave is paused due to a conflict. Does it stop
reading data from the master or does it buffer it up on disk? If it
stops reading it from the master then the effect is the same as if the
slave stopped "requesting" data even if there's no actual request.

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 15:34:43
Message-ID:	26977.1275579283@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
>> Well, if the slave can't keep up, that's a separate problem. It will
>> not fail to keep up as a result of the transmission mechanism.

> No, I mean if the slave is paused due to a conflict. Does it stop
> reading data from the master or does it buffer it up on disk? If it
> stops reading it from the master then the effect is the same as if the
> slave stopped "requesting" data even if there's no actual request.

The data keeps coming in and getting dumped into the slave's pg_xlog.
walsender/walreceiver are not at all tied to the slave's application
of WAL. In principle we could have the code around max_standby_delay
notice just how far behind it's gotten and adjust the delay tolerance
based on that; but I think designing a feedback loop for that is 9.1
material.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 15:43:54
Message-ID:	AANLkTinziZwRWdTdAnc7XiaOr0DYuY56GX47shmST5s_@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 3, 2010 at 4:34 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The data keeps coming in and getting dumped into the slave's pg_xlog.
> walsender/walreceiver are not at all tied to the slave's application
> of WAL. In principle we could have the code around max_standby_delay
> notice just how far behind it's gotten and adjust the delay tolerance
> based on that; but I think designing a feedback loop for that is 9.1
> material.

Er, no. In that case my first concern was misguided. I'm happy there's
no feedback loop -- my fear was that there was and it would mean the
"time received" could be decoupled from the time the wal was
generated. But as you describe it then the time received might be
slightly delayed from the time the wal was generated but to some
constant degree -- not in a way that will be influenced by the log
application being blocked on the slave.

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 16:47:47
Message-ID:	6429.1275583667@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote:
>> This patch seems to me to be going in fundamentally the wrong direction.
>> It's adding complexity and overhead (far more than is needed), and it's
>> failing utterly to resolve the objections that I raised to start with.

> Having read your proposal, it seems changing from time-on-sender to
> time-on-receiver is a one line change to the patch.

> What else are you thinking of removing, if anything?

Basically, we need to get rid of everything that feeds timestamps from
the WAL content into the kill-delay logic.

>> In particular, Simon seems to be basically refusing to do anything about
>> the complaint that the code fails unless master and standby clocks are
>> in close sync. I do not believe that this is acceptable, and since he
>> won't fix it, I guess I'll have to.

> Syncing two servers in replication is common practice, as has been
> explained here; I'm still surprised people think otherwise.

Doesn't affect the complaint in the least: I do not find it acceptable
to have that be *mandated* in order for our code to work sensibly.
I would be OK with having something approaching what you want as a
non-default optional behavior (with a clearly-documented dependency
on having synchronized clocks). But in any case the current behavior is
still quite broken as regards reading stale timestamps from WAL.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 16:58:45
Message-ID:	6558.1275584325@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Wed, 2010-06-02 at 16:00 -0400, Tom Lane wrote:
>> the current situation that query grace periods go to zero

> Possibly a better way to handle this concern is to make the second
> parameter: min_standby_grace_period - the minimum time a query will be
> given in which to execute, even if max_standby_delay has been reached or
> exceeded.

> Would that more directly address you concerns?
> min_standby_grace_period (ms) SIGHUP

A minimum grace period seems like a good idea to me, but I think it's
somewhat orthogonal to the core problem here. I think we all
intuitively feel that there should be a way to dial back the grace
period when a slave is "far behind" on applying WAL. The problem is
first how to figure out what "far behind" means, and second how to
adjust the grace period in a way that doesn't have surprising
misbehaviors. A minimum grace period would prevent some of the very
worst misbehaviors but it's not really addressing the core problem.

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 17:18:24
Message-ID:	1275585504.21465.3023.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > On Wed, 2010-06-02 at 13:14 -0400, Tom Lane wrote:
> >> This patch seems to me to be going in fundamentally the wrong direction.
> >> It's adding complexity and overhead (far more than is needed), and it's
> >> failing utterly to resolve the objections that I raised to start with.
>
> > Having read your proposal, it seems changing from time-on-sender to
> > time-on-receiver is a one line change to the patch.
>
> > What else are you thinking of removing, if anything?
>
> Basically, we need to get rid of everything that feeds timestamps from
> the WAL content into the kill-delay logic.

I understand your wish to do this, though it isn't always accurate to do
that in the case where there is a backlog.

The patch already does that *mostly* for the streaming case. The only
time it does use the WAL content timestamp is when the WAL content
timestamp is later than the oldest receive time, so in that case it is
used as a correction. (see code comments and comments below also)

> >> In particular, Simon seems to be basically refusing to do anything about
> >> the complaint that the code fails unless master and standby clocks are
> >> in close sync. I do not believe that this is acceptable, and since he
> >> won't fix it, I guess I'll have to.
>
> > Syncing two servers in replication is common practice, as has been
> > explained here; I'm still surprised people think otherwise.
>
> Doesn't affect the complaint in the least: I do not find it acceptable
> to have that be *mandated* in order for our code to work sensibly.

OK, accepted.

> I would be OK with having something approaching what you want as a
> non-default optional behavior (with a clearly-documented dependency
> on having synchronized clocks).

Yes, that's what I'd been thinking also. So that gives us a way
forwards.

standby_delay_base = apply (default) | send

determines whether the standby_delay is calculated solely with reference
to the standby server (apply) or whether times from the sending server
are used (send). Use of send implies that the clocks on primary and
standby are synchronised to within a useful accuracy, in which case it
is usual to enforce that with time synchronisation software such as ntp.

> But in any case the current behavior is
> still quite broken as regards reading stale timestamps from WAL.

Agreed. That wasn't the objective of this patch or a priority.

If you're reading from an archive, you need to set max_standby_delay
appropriately, current recommendation is -1.

We can address that concern once the main streaming case is covered, or
we can add that now.

Are you planning to work on these things now as you said? How about I
apply my patch now, then you do another set of changes afterwards to add
the other items you mentioned, since that is now 2 additional parameters
and related code?

--
Simon Riggs www.2ndQuadrant.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 17:32:45
Message-ID:	7229.1275586365@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote:
>> But in any case the current behavior is
>> still quite broken as regards reading stale timestamps from WAL.

> Agreed. That wasn't the objective of this patch or a priority.

> If you're reading from an archive, you need to set max_standby_delay
> appropriately, current recommendation is -1.

That's not a fix; the problem is that there *is no* appropriate setting
for max_standby_delay when you're reading from archive. It is unlikely
that the values of the timestamps you're reading should be considered to
have any bearing on what grace period to allow; but nonetheless it's
desirable to be able to have a non-infinite grace time.

Basically what I think is that we want what you called "apply" semantics
always for reading from archive (and I still think the DBA should be
able to set that grace period separately from the one that applies in
SR operation). Paying attention to the master's timestamps is only
reasonable in the SR context.

And yes, I want the dependency on WAL timestamps to be gone completely,
not just mostly. I think that driving the delay logic off of SR receipt
time (and/or the timestamp we agreed to add to the SR protocol) is the
appropriate and sufficient way to deal with the problem. Looking at the
embedded timestamps just confuses the feedback loop behavior.

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 17:46:31
Message-ID:	1275587191.21465.3092.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-06-03 at 13:32 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > On Thu, 2010-06-03 at 12:47 -0400, Tom Lane wrote:
> >> But in any case the current behavior is
> >> still quite broken as regards reading stale timestamps from WAL.
>
> > Agreed. That wasn't the objective of this patch or a priority.
>
> > If you're reading from an archive, you need to set max_standby_delay
> > appropriately, current recommendation is -1.
>
> That's not a fix; the problem is that there *is no* appropriate setting
> for max_standby_delay when you're reading from archive. It is unlikely
> that the values of the timestamps you're reading should be considered to
> have any bearing on what grace period to allow; but nonetheless it's
> desirable to be able to have a non-infinite grace time.

I'm not saying that's a fix; I agree we should change that also.

> Basically what I think is that we want what you called "apply" semantics
> always for reading from archive (and I still think the DBA should be
> able to set that grace period separately from the one that applies in
> SR operation). Paying attention to the master's timestamps is only
> reasonable in the SR context.

Agreed.

> And yes, I want the dependency on WAL timestamps to be gone completely,
> not just mostly. I think that driving the delay logic off of SR receipt
> time (and/or the timestamp we agreed to add to the SR protocol) is the
> appropriate and sufficient way to deal with the problem. Looking at the
> embedded timestamps just confuses the feedback loop behavior.

I disagree with "sufficient", with good reason. Please look at the code
comments and see what will happen if we don't make the correction
suggested. If after that you still disagree, then do it your way and
we'll wait for comments in the beta; I think there will be some, but I
am tired and prone to agreement to allow this to go through.

--
Simon Riggs www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 22:48:48
Message-ID:	1275605328.1738.1.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote:

> Are you planning to work on these things now as you said?

Are you? Or do you want me to?

--
Simon Riggs www.2ndQuadrant.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-03 23:02:00
Message-ID:	21276.1275606120@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote:
>> Are you planning to work on these things now as you said?

> Are you? Or do you want me to?

I decided there wasn't time to get anything useful done on it before the
beta2 deadline (which is, more or less, right now). I will take another
look over the next few days.

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-09 22:52:31
Message-ID:	1276123951.23257.1114.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > On Thu, 2010-06-03 at 18:18 +0100, Simon Riggs wrote:
> >> Are you planning to work on these things now as you said?
>
> > Are you? Or do you want me to?
>
> I decided there wasn't time to get anything useful done on it before the
> beta2 deadline (which is, more or less, right now). I will take another
> look over the next few days.

Esteemed colleague Tom,

We all really need you to fix up max_standby_delay, or, let me do it.

It appears we both agree that more testing would be beneficial. The only
way we are going to get that is by finishing this off and declaring a
new HS-beta.

Please let me know how you wish to proceed. (No answer means I do it.)

--
Simon Riggs www.2ndQuadrant.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-10 00:01:58
Message-ID:	3591.1276128118@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote:
>> I decided there wasn't time to get anything useful done on it before the
>> beta2 deadline (which is, more or less, right now). I will take another
>> look over the next few days.

> We all really need you to fix up max_standby_delay, or, let me do it.

Yes, I'll get with it ...

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-17 01:49:56
Message-ID:	AANLkTiktFWiB3WlFL1Ef6mBEqvmgsrtETeDcxOlZENpg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On Thu, 2010-06-03 at 19:02 -0400, Tom Lane wrote:
>>> I decided there wasn't time to get anything useful done on it before the
>>> beta2 deadline (which is, more or less, right now). I will take another
>>> look over the next few days.
>
>> We all really need you to fix up max_standby_delay, or, let me do it.
>
> Yes, I'll get with it ...

Tom,

Any update on this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-17 01:56:00
Message-ID:	19863.1276739760@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Yes, I'll get with it ...

> Any update on this?

Sorry, I've been a bit distracted by other responsibilities (libtiff
security issues for Red Hat, if you must know). I'll get on it shortly.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-17 02:03:36
Message-ID:	AANLkTin_EsLSPIAfAkrkbDFGqEaUsbPWzb2KzqUiiCfl@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Yes, I'll get with it ...
>
>> Any update on this?
>
> Sorry, I've been a bit distracted by other responsibilities (libtiff
> security issues for Red Hat, if you must know). I'll get on it shortly.

What? You have other things to do besides hack on PostgreSQL? Shocking! :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-21 04:20:33
Message-ID:	4C1EE891.1000600@cheapcomplexdevices.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Sorry, I've been a bit distracted by other responsibilities (libtiff
>> security issues for Red Hat, if you must know). I'll get on it shortly.
>
> What? You have other things to do besides hack on PostgreSQL? Shocking! :-)

I suspect you're kidding, but in case some on the list didn't realize,
Tom's probably as famous (if not moreso) in the image compression
community as he is in the database community:

http://www.jpeg.org/jpeg/index.html
"Probably the largest and most important contribution however was the work
of the Independent JPEG Group (IJG), and Tom Lane in particular."

http://www.w3.org/TR/PNG-Credits.html , http://www.w3.org/TR/PNG/
"PNG (Portable Network Graphics) Specification
Version 1.0
...
Contributing Editor
Tom Lane, tgl(at)sss(dot)pgh(dot)pa(dot)us"

http://www.fileformat.info/format/tiff/egff.htm
"... by Dr. Tom Lane of the Independent JPEG Group, a member of the
TIFF Advisory Committee"

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-21 12:53:06
Message-ID:	AANLkTimnT0D9iL1wMpltcN-bJH0fTnlPDwvNJ7D1lqw7@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 21, 2010 at 12:20 AM, Ron Mayer
<rm_pg(at)cheapcomplexdevices(dot)com> wrote:
> Robert Haas wrote:
>> On Wed, Jun 16, 2010 at 9:56 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Sorry, I've been a bit distracted by other responsibilities (libtiff
>>> security issues for Red Hat, if you must know). I'll get on it shortly.
>>
>> What? You have other things to do besides hack on PostgreSQL? Shocking! :-)
>
> I suspect you're kidding, but in case some on the list didn't realize,
> Tom's probably as famous (if not moreso) in the image compression
> community as he is in the database community:
>
> http://www.jpeg.org/jpeg/index.html
> "Probably the largest and most important contribution however was the work
> of the Independent JPEG Group (IJG), and Tom Lane in particular."
>
> http://www.w3.org/TR/PNG-Credits.html , http://www.w3.org/TR/PNG/
> "PNG (Portable Network Graphics) Specification
> Version 1.0
> ...
> Contributing Editor
> Tom Lane, tgl(at)sss(dot)pgh(dot)pa(dot)us"
>
> http://www.fileformat.info/format/tiff/egff.htm
> "... by Dr. Tom Lane of the Independent JPEG Group, a member of the
> TIFF Advisory Committee"

Yes, I was joking, hence the smiley. I did know he was involved in
the above, although I confess I didn't know to what degree... or that
he had a doctorate.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-28 07:17:48
Message-ID:	1277709468.25074.63629.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Yes, I'll get with it ...
>
> > Any update on this?
>
> Sorry, I've been a bit distracted by other responsibilities (libtiff
> security issues for Red Hat, if you must know). I'll get on it shortly.

I don't think the PostgreSQL project should wait any longer on this. If
it does we risk loss of quality in final release, assuming no slippage.

>From here, I will rework my patch of 31 May to
* use arrival time on standby as base for max_standby_delay
* make delay apply to both streaming and file cases
* min_standby_grace_period - min grace on every query, default 0

Decision time, so thoughts please?

--
Simon Riggs www.2ndQuadrant.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-28 14:03:02
Message-ID:	AANLkTikg1F9EkmutcBZmcwH_TPntcugvvuG9VqfzGdve@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 28, 2010 at 3:17 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> > On Wed, Jun 9, 2010 at 8:01 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> >> Yes, I'll get with it ...
>>
>> > Any update on this?
>>
>> Sorry, I've been a bit distracted by other responsibilities (libtiff
>> security issues for Red Hat, if you must know). I'll get on it shortly.
>
> I don't think the PostgreSQL project should wait any longer on this. If
> it does we risk loss of quality in final release, assuming no slippage.

I agree, and had actually been intending to post a similar message in
the next day or two. We've been waiting for this for nearly a month.

> >From here, I will rework my patch of 31 May to
> * use arrival time on standby as base for max_standby_delay

Assuming that by this you mean, in the case of SR, the time of receipt
of the current WAL chunk, and in the case of the archive, the time of
its acquisition from the archive, +1.

> * make delay apply to both streaming and file cases

+1.

> * min_standby_grace_period - min grace on every query, default 0

I could go either way on this. +0.5, I guess.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-28 14:19:09
Message-ID:	26936.1277734749@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote:
>> Sorry, I've been a bit distracted by other responsibilities (libtiff
>> security issues for Red Hat, if you must know). I'll get on it shortly.

> I don't think the PostgreSQL project should wait any longer on this. If
> it does we risk loss of quality in final release, assuming no slippage.

It will get done. It is not the very first thing on my to-do list.

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-28 17:09:40
Message-ID:	4C28D754.8010005@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> It will get done. It is not the very first thing on my to-do list.

??? What is then?

If it's not the first thing on your priority list, with 9.0 getting
later by the day, maybe we should leave it to Robert and Simon, who *do*
seem to have it first on *their* list?

I swear, when Simon was keeping his branch to himself in August everyone
was on his case. It sure seems like Tom is doing exactly the same thing.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-28 17:24:32
Message-ID:	AANLkTikSHAf-J2IAY3bpcXqS0hw2mAARyvQrIcihIX87@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 28, 2010 at 10:19 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On Wed, 2010-06-16 at 21:56 -0400, Tom Lane wrote:
>>> Sorry, I've been a bit distracted by other responsibilities (libtiff
>>> security issues for Red Hat, if you must know). I'll get on it shortly.
>
>> I don't think the PostgreSQL project should wait any longer on this. If
>> it does we risk loss of quality in final release, assuming no slippage.
>
> It will get done. It is not the very first thing on my to-do list.

That's apparent. On June 9th, you wrote "Yes, I'll get with it ...";
on June 16th, you wrote "I'll get on it shortly." Two weeks later
you're backing off from "shortly" to "eventually". It is unfair to
the community to assert a vigorous opinion of how something should be
handled in the code and then refuse to commit to a time frame for
providing a patch. It is even more unreasonable to commit to
providing a timely patch (twice) and then fail to do so. We are
trying to finalize a release here, and you've made it clear you think
this code needs revision before then. I respect your opinion, but not
your right to make the project release timetable dependent on your own
schedule, and not your right to shut other people out of working on
the issues you've raised.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-28 18:26:19
Message-ID:	1347.1277749579@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> ... It is even more unreasonable to commit to
> providing a timely patch (twice) and then fail to do so. We are
> trying to finalize a release here, and you've made it clear you think
> this code needs revision before then. I respect your opinion, but not
> your right to make the project release timetable dependent on your own
> schedule, and not your right to shut other people out of working on
> the issues you've raised.

Since nobody has put forward a proposed beta3 release date, I don't feel
that I'm holding anything up. In the meantime, I have many
responsibilities and am facing Red Hat internal deadlines.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-28 18:55:59
Message-ID:	AANLkTimLmGUgwGDIx3jmnaQXc9FJSa8iNn90JiczgZbx@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 28, 2010 at 2:26 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> ... It is even more unreasonable to commit to
>> providing a timely patch (twice) and then fail to do so. We are
>> trying to finalize a release here, and you've made it clear you think
>> this code needs revision before then. I respect your opinion, but not
>> your right to make the project release timetable dependent on your own
>> schedule, and not your right to shut other people out of working on
>> the issues you've raised.
>
> Since nobody has put forward a proposed beta3 release date, I don't feel
> that I'm holding anything up. In the meantime, I have many
> responsibilities and am facing Red Hat internal deadlines.

See here, last paragraph:

http://archives.postgresql.org/pgsql-hackers/2010-06/msg01093.php

On a related note, this list is getting pretty darn short:

http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items

...partly because, in my desire to get another beta out, I have been
devoting a lot of time to clearing it out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-30 19:08:16
Message-ID:	1277924896.4151.274.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2010-06-28 at 10:09 -0700, Josh Berkus wrote:
> > It will get done. It is not the very first thing on my to-do list.
>
> ??? What is then?
>
> If it's not the first thing on your priority list, with 9.0 getting
> later by the day, maybe we should leave it to Robert and Simon, who *do*
> seem to have it first on *their* list?
>
> I swear, when Simon was keeping his branch to himself in August everyone
> was on his case. It sure seems like Tom is doing exactly the same thing.

Hmmm, yes, looks that way. At that time I was actively working on the
code, not just locking it to prevent other activity.

The only urgency on my part here was to fulfil my responsibility to the
project.

--
Simon Riggs www.2ndQuadrant.com

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-06-30 21:57:04
Message-ID:	201006302157.o5ULv5h11744@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Mon, 2010-06-28 at 10:09 -0700, Josh Berkus wrote:
> > > It will get done. It is not the very first thing on my to-do list.
> >
> > ??? What is then?
> >
> > If it's not the first thing on your priority list, with 9.0 getting
> > later by the day, maybe we should leave it to Robert and Simon, who *do*
> > seem to have it first on *their* list?
> >
> > I swear, when Simon was keeping his branch to himself in August everyone
> > was on his case. It sure seems like Tom is doing exactly the same thing.
>
> Hmmm, yes, looks that way. At that time I was actively working on the
> code, not just locking it to prevent other activity.
>
> The only urgency on my part here was to fulfil my responsibility to the
> project.

Simon, you have a very legitimate concern. I phoned Tom and he is
planning to start working on the max_standby_delay tomorrow. I am
unclear how it is different from your version, but I hope once Tom is
done we can review his work and decide how to proceed. The fact that we
allowed Tom this huge amount of time to submit an alternative patch is
unusual and hopefully rare.

FYI, Tom and I are hoping to work through all the outstanding issues
before we package up 9.0 beta3 on Thursday, July 8.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ None of us is going to be here forever. +

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-07-02 20:11:35
Message-ID:	26561.1278101495@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

[ Apologies for the very slow turnaround on this --- I got hit with
another batch of non-postgres security issues this week. ]

Attached is a draft patch for revising the max_standby_delay behavior into
something I think is a bit saner. There is some unfinished business:

* I haven't touched the documentation yet

* The code in xlog.c needs to be reverted to its former behavior so that
recoveryLastXTime means what it's supposed to mean, ie just the last
commit/abort timestamp.

However neither of these affects the testability of the patch.

Basically the way that it works is that the standby delay is computed with
reference to XLogReceiptTime rather than any timestamp obtained from WAL.
XLogReceiptTime is set to current time whenever we obtain a WAL segment
from the archives or when we begin processing a fresh batch of WAL from
walreceiver. There's a subtlety in the streaming case: we don't advance
XLogReceiptTime if we are not caught up, that is if the startup process
is more than one flush cycle behind walreceiver. In the normal case
we'll advance XLogReceiptTime on each flush cycle, but once we start
falling behind, it doesn't move so the grace time alloted to conflicting
queries begins to decrease.

I split max_standby_delay into two GUC variables, as previously
threatened: max_standby_archive_delay and max_standby_streaming_delay.
The former applies when processing data read from archive, the latter
when processing data received from walreceiver. I think this is really
quite important given the new behavior, because max_standby_archive_delay
ought to be set with reference to the expected time to process one WAL
segment, whereas max_standby_streaming_delay doesn't depend on that value
at all. I'm not sure what good defaults are for these values, so I left
them at 30 seconds for the moment. I'm inclined to think
max_standby_archive_delay ought to be quite a bit less though.

It might be worth adding a minimum-grace-time limit as we previously
discussed, but this patch doesn't attempt to do that.

Comments?

regards, tom lane

Attachment	Content-Type	Size
maxstandbydelay-1.patch	text/x-patch	22.5 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-07-02 20:20:46
Message-ID:	AANLkTinlTL_dQogSzNWaf72-ovl3d8PeWE8ZRwrErxTw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 2, 2010 at 4:11 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> [ Apologies for the very slow turnaround on this --- I got hit with
> another batch of non-postgres security issues this week. ]
>
> Attached is a draft patch for revising the max_standby_delay behavior into
> something I think is a bit saner. There is some unfinished business:
>
> * I haven't touched the documentation yet
>
> * The code in xlog.c needs to be reverted to its former behavior so that
> recoveryLastXTime means what it's supposed to mean, ie just the last
> commit/abort timestamp.
>
> However neither of these affects the testability of the patch.
>
> Basically the way that it works is that the standby delay is computed with
> reference to XLogReceiptTime rather than any timestamp obtained from WAL.
> XLogReceiptTime is set to current time whenever we obtain a WAL segment
> from the archives or when we begin processing a fresh batch of WAL from
> walreceiver. There's a subtlety in the streaming case: we don't advance
> XLogReceiptTime if we are not caught up, that is if the startup process
> is more than one flush cycle behind walreceiver. In the normal case
> we'll advance XLogReceiptTime on each flush cycle, but once we start
> falling behind, it doesn't move so the grace time alloted to conflicting
> queries begins to decrease.
>
> I split max_standby_delay into two GUC variables, as previously
> threatened: max_standby_archive_delay and max_standby_streaming_delay.
> The former applies when processing data read from archive, the latter
> when processing data received from walreceiver. I think this is really
> quite important given the new behavior, because max_standby_archive_delay
> ought to be set with reference to the expected time to process one WAL
> segment, whereas max_standby_streaming_delay doesn't depend on that value
> at all. I'm not sure what good defaults are for these values, so I left
> them at 30 seconds for the moment. I'm inclined to think
> max_standby_archive_delay ought to be quite a bit less though.
>
> It might be worth adding a minimum-grace-time limit as we previously
> discussed, but this patch doesn't attempt to do that.
>
> Comments?

I haven't been able to wrap my head around why the delay should be
LESS in the archive case than in the streaming case. Can you attempt
to hit me with the clue-by-four?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-07-02 20:36:22
Message-ID:	26941.1278102982@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I haven't been able to wrap my head around why the delay should be
> LESS in the archive case than in the streaming case. Can you attempt
> to hit me with the clue-by-four?

In the archive case, you're presumably trying to catch up, and so it
makes sense to kill queries faster so you can catch up. The existing
code essentially forces instant kill when reading from archive, for any
reasonable value of max_standby_delay (because the archived timestamps
will probably be older than that). That's overenthusiastic in my view,
but you can get that behavior if you want it with this patch by setting
max_standby_archive_delay to zero. If you don't want it, you can use
something larger. If you don't think that max_standby_archive_delay
should be less than max_standby_streaming_delay, you can set them the
same, too.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-07-02 20:40:58
Message-ID:	AANLkTin3_hE8TVO71GzIVBVISaelJNdiS6EJmHaTzlKS@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I haven't been able to wrap my head around why the delay should be
>> LESS in the archive case than in the streaming case. Can you attempt
>> to hit me with the clue-by-four?
>
> In the archive case, you're presumably trying to catch up, and so it
> makes sense to kill queries faster so you can catch up.

On the flip side, the timeout for the WAL segment is for 16MB of WAL,
whereas the timeout for SR is normally going to be for a much smaller
chunk (right?). So even with the same value for both, it seems like
queries will be killed more aggressively during archive recovery.

Even so, it seems useful to have both.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-07-02 20:52:08
Message-ID:	27228.1278103928@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> In the archive case, you're presumably trying to catch up, and so it
>> makes sense to kill queries faster so you can catch up.

> On the flip side, the timeout for the WAL segment is for 16MB of WAL,
> whereas the timeout for SR is normally going to be for a much smaller
> chunk (right?). So even with the same value for both, it seems like
> queries will be killed more aggressively during archive recovery.

True, but I suspect useful settings will be significantly larger than
those times anyway, so it kind of comes out in the wash. For
walreceiver the expected time to process each new chunk is less than
wal_sender_delay (if it's not, you better get a faster slave machine).
For the archive case, it probably takes more than 200ms to process a
16MB segment, but still only a few seconds. So if you have both the
max-delay parameters set to 30 seconds, the minimum "normal" grace
periods are 29.8 sec vs maybe 25 sec, not all that different.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-07-02 21:01:15
Message-ID:	AANLkTilNSI5N-3OE2y7IoNL_1QdoAI0Ebix0VM-dUfuM@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 2, 2010 at 4:52 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Fri, Jul 2, 2010 at 4:36 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> In the archive case, you're presumably trying to catch up, and so it
>>> makes sense to kill queries faster so you can catch up.
>
>> On the flip side, the timeout for the WAL segment is for 16MB of WAL,
>> whereas the timeout for SR is normally going to be for a much smaller
>> chunk (right?). So even with the same value for both, it seems like
>> queries will be killed more aggressively during archive recovery.
>
> True, but I suspect useful settings will be significantly larger than
> those times anyway, so it kind of comes out in the wash. For
> walreceiver the expected time to process each new chunk is less than
> wal_sender_delay (if it's not, you better get a faster slave machine).
> For the archive case, it probably takes more than 200ms to process a
> 16MB segment, but still only a few seconds. So if you have both the
> max-delay parameters set to 30 seconds, the minimum "normal" grace
> periods are 29.8 sec vs maybe 25 sec, not all that different.

I guess I was thinking more in terms of conflicts-per-WAL-record than
actual replay time. If we assume that ever 100th WAL record produces
a conflict, SR will pause every once in a while, and then catch up
(either because it canceled some queries or because they finished
within the allowable grace period), whereas archive recovery will be
patient for a bit and then start nuking 'em from orbit. Maybe my
mental model is all wrong though.

Anyway, I don't object to having two separate settings, and maybe
we'll know more about how to tune them after this gets out in the
field.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-07-03 11:59:10
Message-ID:	4C2F260E.10100@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 02/07/10 23:36, Tom Lane wrote:
> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>> I haven't been able to wrap my head around why the delay should be
>> LESS in the archive case than in the streaming case. Can you attempt
>> to hit me with the clue-by-four?
>
> In the archive case, you're presumably trying to catch up, and so it
> makes sense to kill queries faster so you can catch up. The existing
> code essentially forces instant kill when reading from archive, for any
> reasonable value of max_standby_delay (because the archived timestamps
> will probably be older than that). That's overenthusiastic in my view,
> but you can get that behavior if you want it with this patch by setting
> max_standby_archive_delay to zero. If you don't want it, you can use
> something larger. If you don't think that max_standby_archive_delay
> should be less than max_standby_streaming_delay, you can set them the
> same, too.

It would seem logical to use the same logic for archive recovery as we
do for streaming replication, and only set XLogReceiptTime when you have
to wait for a WAL segment to arrive into the archive, ie. when
restore_command fails.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-07-03 15:32:09
Message-ID:	11126.1278171129@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> It would seem logical to use the same logic for archive recovery as we
> do for streaming replication, and only set XLogReceiptTime when you have
> to wait for a WAL segment to arrive into the archive, ie. when
> restore_command fails.

That would not do what you want at all in the case where you're
recovering from archive --- XLogReceiptTime would never advance
at all for the duration of the recovery.

It might be useful if you knew that it was a standby-with-log-shipping
situation, but we have no way to tell the difference.

The restore_command might know the difference, though. Is it
worth modifying its API somehow so that the command could tell us
whether to advance XLogReceiptTime? How would we do that?

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-07-03 19:17:48
Message-ID:	4C2F8CDC.5040702@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/07/10 18:32, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> It would seem logical to use the same logic for archive recovery as we
>> do for streaming replication, and only set XLogReceiptTime when you have
>> to wait for a WAL segment to arrive into the archive, ie. when
>> restore_command fails.
>
> That would not do what you want at all in the case where you're
> recovering from archive --- XLogReceiptTime would never advance
> at all for the duration of the recovery.

Do you mean when using something like pg_standby, which does the waiting
itself?

> It might be useful if you knew that it was a standby-with-log-shipping
> situation, but we have no way to tell the difference.

With pg_standby etc. you use standby_mode=off. Same with traditional
archive recovery. In standby mode, it's on.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Keepalive for max_standby_delay
Date:	2010-07-03 20:18:11
Message-ID:	14694.1278188291@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 03/07/10 18:32, Tom Lane wrote:
>> That would not do what you want at all in the case where you're
>> recovering from archive --- XLogReceiptTime would never advance
>> at all for the duration of the recovery.

> Do you mean when using something like pg_standby, which does the waiting
> itself?

No, I'm thinking about recovery starting from an old base backup, or any
situation where you're trying to process a significant number of
archived WAL segments as quickly as possible.

>> It might be useful if you knew that it was a standby-with-log-shipping
>> situation, but we have no way to tell the difference.

> With pg_standby etc. you use standby_mode=off. Same with traditional
> archive recovery. In standby mode, it's on.

Uh, no, that parameter is not what I'm talking about. What I tried to
say is that if you're using log shipping for replication instead of
walsender/walreceiver, you might want to treat data as live even though
the database thinks it is being pulled "from archive". But we'd need
a way for the database to tell the cases apart ... right now it cannot.

regards, tom lane