Quick Links

Re: Hot standby, recovery infra

Lists:	pgsql-hackers

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Hot standby, recovery infra
Date:	2009-01-28 10:04:57
Message-ID:	49802DC9.6000406@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I've been reviewing and massaging the so called recovery infra patch.

To recap, the goal is to:
- start background writer during (archive) recovery
- skip the shutdown checkpoint at the end of recovery. Instead, the
database is brought up immediately, and the bgwriter performs a normal
online checkpoint, while we're already accepting connections.
- keep track of when we reach a consistent point in the recovery, where
we could let read-only backends in. Which is obviously required for hot
standby

The 1st and 2nd points provide some useful functionality, even without
the rest of the hot standby patch.

I've refactored the patch quite heavily, making it a lot simpler, and
over 1/3 smaller than before:

The signaling between the bgwriter and startup process during recovery
was quite complicated. The startup process periodically sent checkpoint
records to the bgwriter, so that bgwriter could perform restart points.
I've replaced that by storing the last seen checkpoint in a shared
memory in xlog.c. CreateRestartPoint() picks it up from there. This
means that bgwriter can decide autonomously when to perform a restart
point, it no longer needs to be told to do so by the startup process.
Which is nice in a standby. What could happen before is that the standby
processes a checkpoint record, and decides not to make it a restartpoint
because not enough time has passed since last one. If we then get a long
idle period after that, we'd really want to make the previous checkpoint
record a restart point after all, after some time has passed. That is
what will happen now, which is a usability enhancement, although the
real motivation for this refactoring was to make the code simpler.

The bgwriter is now always responsible for all checkpoints and
restartpoints. (well, except for a stand-alone backend). Which makes it
easier to understand what's going on, IMHO.

There was one pretty fundamental bug in the minsafestartpoint handling:
it was always set when a WAL file was opened for reading. Which means it
was also moved backwards when the recovery began by reading the WAL
segment containing last restart/checkpoint, rendering it useless for the
purpose it was designed. Fortunately that was easy to fix. Another tiny
bug was that log_restartpoints was not respected, because it was stored
in a variable in startup process' memory, and wasn't seen by bgwriter.

One aspect that troubles me a bit is the changes in XLogFlush. I guess
we no longer have the problem that you can't start up the database if
we've read in a corrupted page from disk, because we now start up before
checkpointing. However, it does mean that if a corrupt page is read into
shared buffers, we'll never be able to checkpoint. But then again, I
guess that's already true without this patch.

I feel quite good about this patch now. Given the amount of code churn,
it requires testing, and I'll read it through one more time after
sleeping over it. Simon, do you see anything wrong with this?

(this patch is also in my git repository at git.postgresql.org, branch
recoveryinfra.)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
recovery-infra-dee8f65be.patch	text/x-diff	44.0 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-28 12:42:34
Message-ID:	1233146554.2327.2343.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-01-28 at 12:04 +0200, Heikki Linnakangas wrote:
> I've been reviewing and massaging the so called recovery infra patch.

Thanks.

> I feel quite good about this patch now. Given the amount of code
> churn, it requires testing, and I'll read it through one more time
> after sleeping over it.

There's nothing major I feel we should discuss.

The way restartpoints happen is a useful improvement, thanks.

> Simon, do you see anything wrong with this?

Few minor points

* I think we are now renaming the recovery.conf file too early. The
comment says "We have already restored all the WAL segments we need from
the archive, and we trust that they are not going to go away even if we
crash." We have, but the files overwrite each other as they arrive, so
if the last restartpoint is not in the last restored WAL file then it
will only exist in the archive. The recovery.conf is the only place
where we store the information on where the archive is and how to access
it, so by renaming it out of the way we will be unable to crash recover
until the first checkpoint is complete. So the way this was in the
original patch is the correct way to go, AFAICS.

* my original intention was to deprecate log_restartpoints and would
still like to do so. log_checkpoints does just as well for that. Even
less code than before...

* comment on BgWriterShmemInit() refers to CHECKPOINT_IS_STARTUP, but
the actual define is CHECKPOINT_STARTUP. Would prefer the "is" version
because it matches the IS_SHUTDOWN naming.

* In CreateCheckpoint() the if test on TruncateSubtrans() has been
removed, but the comment has not been changed (to explain why).

* PG_CONTROL_VERSION bump should be just one increment, to 844. I
deliberately had it higher to help spot mismatches earlier, and to avoid
needless patch conflicts.

So it looks pretty much ready for commit very soon.

We should continue to measure performance of recovery in the light of
these changes. I still feel that fsyncing the control file on each
XLogFileRead() will give a noticeable performance penalty, mostly
because we know doing exactly the same thing in normal running caused a
performance penalty. But that is easily changed and cannot be done with
any certainty without wider feedback, so no reason to delay code commit.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-28 14:19:50
Message-ID:	3f0b79eb0901280619y6ced2f48s7da3ab8969adaedb@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Wed, Jan 28, 2009 at 7:04 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I've been reviewing and massaging the so called recovery infra patch.

Great!

> I feel quite good about this patch now. Given the amount of code churn, it
> requires testing, and I'll read it through one more time after sleeping over
> it. Simon, do you see anything wrong with this?

I also read this patch and found something odd. I apologize if I misread it..

> @@ -507,7 +550,7 @@ CheckArchiveTimeout(void)
> pg_time_t now;
> pg_time_t last_time;
>
> - if (XLogArchiveTimeout <= 0)
> + if (XLogArchiveTimeout <= 0 || !IsRecoveryProcessingMode())

The above change destroys archive_timeout because checking the timeout
is always skipped after recovery is done.

> @@ -355,6 +359,27 @@ BackgroundWriterMain(void)
> */
> PG_SETMASK(&UnBlockSig);
>
> + BgWriterRecoveryMode = IsRecoveryProcessingMode();
> +
> + if (BgWriterRecoveryMode)
> + elog(DEBUG1, "bgwriter starting during recovery");
> + else
> + InitXLOGAccess();

Why is InitXLOGAccess() called also here when bgwriter is started after
recovery? That is already called by AuxiliaryProcessMain().

> @@ -1302,7 +1314,7 @@ ServerLoop(void)
> * state that prevents it, start one. It doesn't matter if this
> * fails, we'll just try again later.
> */
> - if (BgWriterPID == 0 && pmState == PM_RUN)
> + if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
> BgWriterPID = StartBackgroundWriter();

Likewise, we should try to start also the stats collector during recovery?

> @@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
> unlink(tmppath);
> }
>
> - elog(DEBUG2, "done creating and filling new WAL file");
> + XLogFileName(tmppath, ThisTimeLineID, log, seg);
> + elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);

This debug message is somewhat confusing, because the WAL file
represented as "tmppath" might have been already created by
previous XLogFileInit() via InstallXLogFileSegment().

regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-28 14:47:48
Message-ID:	1233154068.2327.2385.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-01-28 at 23:19 +0900, Fujii Masao wrote:

> > @@ -355,6 +359,27 @@ BackgroundWriterMain(void)
> > */
> > PG_SETMASK(&UnBlockSig);
> >
> > + BgWriterRecoveryMode = IsRecoveryProcessingMode();
> > +
> > + if (BgWriterRecoveryMode)
> > + elog(DEBUG1, "bgwriter starting during recovery");
> > + else
> > + InitXLOGAccess();
>
> Why is InitXLOGAccess() called also here when bgwriter is started after
> recovery? That is already called by AuxiliaryProcessMain().

InitXLOGAccess() sets the timeline and also gets the latest record
pointer. If the bgwriter is started in recovery these values need to be
reset later. It's easier to call it twice.

> > @@ -1302,7 +1314,7 @@ ServerLoop(void)
> > * state that prevents it, start one. It doesn't matter if this
> > * fails, we'll just try again later.
> > */
> > - if (BgWriterPID == 0 && pmState == PM_RUN)
> > + if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
> > BgWriterPID = StartBackgroundWriter();
>
> Likewise, we should try to start also the stats collector during recovery?

We did in the previous patch...

> > @@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
> > unlink(tmppath);
> > }
> >
> > - elog(DEBUG2, "done creating and filling new WAL file");
> > + XLogFileName(tmppath, ThisTimeLineID, log, seg);
> > + elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);
>
> This debug message is somewhat confusing, because the WAL file
> represented as "tmppath" might have been already created by
> previous XLogFileInit() via InstallXLogFileSegment().

I think those are just for debugging and can be removed.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-28 14:54:53
Message-ID:	3f0b79eb0901280654y572398b3s293c2883bf908cb@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Wed, Jan 28, 2009 at 11:47 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
> On Wed, 2009-01-28 at 23:19 +0900, Fujii Masao wrote:
>
>> > @@ -355,6 +359,27 @@ BackgroundWriterMain(void)
>> > */
>> > PG_SETMASK(&UnBlockSig);
>> >
>> > + BgWriterRecoveryMode = IsRecoveryProcessingMode();
>> > +
>> > + if (BgWriterRecoveryMode)
>> > + elog(DEBUG1, "bgwriter starting during recovery");
>> > + else
>> > + InitXLOGAccess();
>>
>> Why is InitXLOGAccess() called also here when bgwriter is started after
>> recovery? That is already called by AuxiliaryProcessMain().
>
> InitXLOGAccess() sets the timeline and also gets the latest record
> pointer. If the bgwriter is started in recovery these values need to be
> reset later. It's easier to call it twice.

Right. But, InitXLOGAccess() called during main loop is enough for
that purpose.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-28 15:25:49
Message-ID:	1233156349.2327.2418.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-01-28 at 23:54 +0900, Fujii Masao wrote:
> >> Why is InitXLOGAccess() called also here when bgwriter is started after
> >> recovery? That is already called by AuxiliaryProcessMain().
> >
> > InitXLOGAccess() sets the timeline and also gets the latest record
> > pointer. If the bgwriter is started in recovery these values need to be
> > reset later. It's easier to call it twice.
>
> Right. But, InitXLOGAccess() called during main loop is enough for
> that purpose.

I think the code is clearer the way it is. Otherwise you'd read
AuxiliaryProcessMain() and think the bgwriter didn't need xlog access.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-28 15:38:40
Message-ID:	49807C00.5020308@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> On Wed, Jan 28, 2009 at 7:04 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> I feel quite good about this patch now. Given the amount of code churn, it
>> requires testing, and I'll read it through one more time after sleeping over
>> it. Simon, do you see anything wrong with this?
>
> I also read this patch and found something odd.

Many thanks for looking into this!

>> @@ -507,7 +550,7 @@ CheckArchiveTimeout(void)
>> pg_time_t now;
>> pg_time_t last_time;
>>
>> - if (XLogArchiveTimeout <= 0)
>> + if (XLogArchiveTimeout <= 0 || !IsRecoveryProcessingMode())
>
> The above change destroys archive_timeout because checking the timeout
> is always skipped after recovery is done.

Yep, good catch. That obviously needs to be
"IsRecoveryProcessingMode()", without the exclamation mark.

>> @@ -355,6 +359,27 @@ BackgroundWriterMain(void)
>> */
>> PG_SETMASK(&UnBlockSig);
>>
>> + BgWriterRecoveryMode = IsRecoveryProcessingMode();
>> +
>> + if (BgWriterRecoveryMode)
>> + elog(DEBUG1, "bgwriter starting during recovery");
>> + else
>> + InitXLOGAccess();
>
> Why is InitXLOGAccess() called also here when bgwriter is started after
> recovery? That is already called by AuxiliaryProcessMain().

Oh, I didn't realize that. Agreed, it's a bit disconcerting that
InitXLOGAccess() is called twice (there was a 2nd call within the loop
in the original patch as well). Looking at InitXLOGAccess, it's harmless
to call it multiple times, but it seems better to remove the
InitXLOGAccess call from AuxiliaryProcessMain().

InitXLOGAccess() needs to be called after seeing that
IsRecoveryProcessingMode() == false, because it won't get the right
timeline id before that.

>> @@ -1302,7 +1314,7 @@ ServerLoop(void)
>> * state that prevents it, start one. It doesn't matter if this
>> * fails, we'll just try again later.
>> */
>> - if (BgWriterPID == 0 && pmState == PM_RUN)
>> + if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
>> BgWriterPID = StartBackgroundWriter();
>
> Likewise, we should try to start also the stats collector during recovery?

Hmm, there's not much point as this patch stands, but I guess we should
in the hot standby patch, where we let backends in.

>> @@ -2103,7 +2148,8 @@ XLogFileInit(uint32 log, uint32 seg,
>> unlink(tmppath);
>> }
>>
>> - elog(DEBUG2, "done creating and filling new WAL file");
>> + XLogFileName(tmppath, ThisTimeLineID, log, seg);
>> + elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);
>
> This debug message is somewhat confusing, because the WAL file
> represented as "tmppath" might have been already created by
> previous XLogFileInit() via InstallXLogFileSegment().

I don't quite understand what you're saying, but I think I'll just
revert that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 01:36:14
Message-ID:	3f0b79eb0901281736v16dc15c9q37403808fa6cf67e@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> I feel quite good about this patch now. Given the amount of code churn, it
>> requires testing, and I'll read it through one more time after sleeping over
>> it. Simon, do you see anything wrong with this?
>
> I also read this patch and found something odd. I apologize if I misread it..

If archive recovery fails after it reaches the last valid record
in the last unfilled WAL segment, subsequent recovery might cause
the following fatal error. This is because minSafeStartPoint indicates
the end of the last unfilled WAL segment which subsequent recovery
cannot reach. Is this bug? (I'm not sure how to fix this problem
because I don't understand yet why minSafeStartPoint is required.)

> FATAL: WAL ends before end time of backup dump

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 03:18:12
Message-ID:	3f0b79eb0901281918s299d3d1cwa335f847bb99bed5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Sorry for my random reply.

Though this is a matter of taste, I think that it's weird that bgwriter
runs with ThisTimeLineID = 0 during recovery. This is because
XLogCtl->ThisTimeLineID is set at the end of recovery. ISTM this will
be a cause of bug in the near future, though this is harmless currently.

> + /*
> + * XXX: Should we try to perform restartpoints if we're not in consistent
> + * recovery? The bgwriter isn't doing it for us in that case.
> + */

I think yes. This is helpful for the system which it takes a long time to get
a base backup, ie. it also takes a long time to reach a consistent recovery
point.

> +CreateRestartPoint(int flags)
<snip>
> + * We rely on this lock to ensure that the startup process doesn't exit
> + * Recovery while we are half way through a restartpoint. XXX ?
> */
> + LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

Is this comment correct? CheckpointLock cannot prevent the startup process
from exiting recovery because the startup process doesn't acquire that lock.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 06:59:28
Message-ID:	1233212368.2327.2580.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-01-29 at 12:18 +0900, Fujii Masao wrote:
> Hi,
>
> On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> >> I feel quite good about this patch now. Given the amount of code churn, it
> >> requires testing, and I'll read it through one more time after sleeping over
> >> it. Simon, do you see anything wrong with this?
> >
> > I also read this patch and found something odd. I apologize if I misread it..
>
> Sorry for my random reply.
>
> Though this is a matter of taste, I think that it's weird that bgwriter
> runs with ThisTimeLineID = 0 during recovery. This is because
> XLogCtl->ThisTimeLineID is set at the end of recovery. ISTM this will
> be a cause of bug in the near future, though this is harmless currently.

It doesn't. That's exactly what InitXLogAccess() was for.

> > + /*
> > + * XXX: Should we try to perform restartpoints if we're not in consistent
> > + * recovery? The bgwriter isn't doing it for us in that case.
> > + */
>
> I think yes. This is helpful for the system which it takes a long time to get
> a base backup, ie. it also takes a long time to reach a consistent recovery
> point.

The original patch did this.

> > +CreateRestartPoint(int flags)
> <snip>
> > + * We rely on this lock to ensure that the startup process doesn't exit
> > + * Recovery while we are half way through a restartpoint. XXX ?
> > */
> > + LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
>
> Is this comment correct? CheckpointLock cannot prevent the startup process
> from exiting recovery because the startup process doesn't acquire that lock.

The original patch acquired CheckpointLock during exitRecovery to prove
that a restartpoint was not in progress. It no longer does this, so not
sure if Heikki has found another way and the comment is wrong, or that
removing the lock request is a bug.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 07:20:58
Message-ID:	1233213658.2327.2598.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote:
> Hi,
>
> On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> >> I feel quite good about this patch now. Given the amount of code churn, it
> >> requires testing, and I'll read it through one more time after sleeping over
> >> it. Simon, do you see anything wrong with this?
> >
> > I also read this patch and found something odd. I apologize if I misread it..
>
> If archive recovery fails after it reaches the last valid record
> in the last unfilled WAL segment, subsequent recovery might cause
> the following fatal error. This is because minSafeStartPoint indicates
> the end of the last unfilled WAL segment which subsequent recovery
> cannot reach. Is this bug? (I'm not sure how to fix this problem
> because I don't understand yet why minSafeStartPoint is required.)
>
> > FATAL: WAL ends before end time of backup dump

I think you're right. We need a couple of changes to avoid confusing
messages.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 07:34:48
Message-ID:	49815C18.5040609@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-01-29 at 12:18 +0900, Fujii Masao wrote:
>> Though this is a matter of taste, I think that it's weird that bgwriter
>> runs with ThisTimeLineID = 0 during recovery. This is because
>> XLogCtl->ThisTimeLineID is set at the end of recovery. ISTM this will
>> be a cause of bug in the near future, though this is harmless currently.
>
> It doesn't. That's exactly what InitXLogAccess() was for.

It does *during recovery*, before InitXLogAccess is called. Yeah, it's
harmless currently. It would be pretty hard to keep it up-to-date in
bgwriter and other processes. I think it's better to keep it at 0, which
is clearly an invalid value, than try to keep it up-to-date and risk
using an old value. TimeLineID is not used in a lot of places,
currently. It might be a good idea to add some "Assert(TimeLineID != 0)"
to places where it used.

>>> + /*
>>> + * XXX: Should we try to perform restartpoints if we're not in consistent
>>> + * recovery? The bgwriter isn't doing it for us in that case.
>>> + */
>> I think yes. This is helpful for the system which it takes a long time to get
>> a base backup, ie. it also takes a long time to reach a consistent recovery
>> point.
>
> The original patch did this.

Yeah, I took it out. ISTM if you restore from a base backup, you'd want
to run it until consistent recovery anyway. We can put it back in if
people feel it's helpful. There's two ways to do it: we can fire up the
bgwriter before reaching consistent recovery point, or we can do the
restartpoints in startup process before that point. I'm inclined to fire
up the bgwriter earlier. That way, bgwriter remains responsible for all
checkpoints and restartpoints, which seems clearer. I can't see any
particular reason why bgwriter startup and reaching the consistent
recovery point need to be linked together.

>>> +CreateRestartPoint(int flags)
>> <snip>
>>> + * We rely on this lock to ensure that the startup process doesn't exit
>>> + * Recovery while we are half way through a restartpoint. XXX ?
>>> */
>>> + LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
>> Is this comment correct? CheckpointLock cannot prevent the startup process
>> from exiting recovery because the startup process doesn't acquire that lock.
>
> The original patch acquired CheckpointLock during exitRecovery to prove
> that a restartpoint was not in progress. It no longer does this, so not
> sure if Heikki has found another way and the comment is wrong, or that
> removing the lock request is a bug.

The comment is obsolete. There's no reason for startup process to wait
for a restartpoint to finish. Looking back at the archives and the
history of that, I got the impression that in a very early version of
the patch, startup process still created a shutdown checkpoint after
recovery. To do that, it had to interrupt an ongoing restartpoint. That
was deemed too dangerous/weird, so it was changed to wait for it to
finish instead. But since the startup process no longer creates a
shutdown checkpoint, the waiting became obsolete, right?

If there's a restartpoint in progress when we reach the end of recovery,
startup process will RequestCheckpoint as usual. That will cause
bgwriter to hurry the on-going restartpoint, skipping the usual delays,
and start the checkpoint as soon as the restartpoint is finished.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 08:08:47
Message-ID:	1233216527.2327.2619.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-01-29 at 09:34 +0200, Heikki Linnakangas wrote:

> It does *during recovery*, before InitXLogAccess is called. Yeah, it's
> harmless currently. It would be pretty hard to keep it up-to-date in
> bgwriter and other processes. I think it's better to keep it at 0,
> which is clearly an invalid value, than try to keep it up-to-date and
> risk using an old value. TimeLineID is not used in a lot of places,
> currently. It might be a good idea to add some "Assert(TimeLineID !=
> 0)" to places where it used.

Agreed. TimeLineID is a normal-running concept used for writing WAL.
Perhaps we should even solidify the meaning of TimeLineID == 0 as
"cannot write WAL".

I see a problem there for any process that exists both before and after
recovery ends, which includes bgwriter. In that case we must not flush
WAL before recovery ends, yet afterwards we *must* flush WAL. To do that
we *must* have a valid TimeLineID set.

I would suggest we put InitXLogAccess into IsRecoveryProcessingMode(),
so if the mode changes we immediately set everything we need to allow
XLogFlush() calls to work correctly.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 09:20:06
Message-ID:	498174C6.5000407@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote:
>> Hi,
>>
>> On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>> I feel quite good about this patch now. Given the amount of code churn, it
>>>> requires testing, and I'll read it through one more time after sleeping over
>>>> it. Simon, do you see anything wrong with this?
>>> I also read this patch and found something odd. I apologize if I misread it..
>> If archive recovery fails after it reaches the last valid record
>> in the last unfilled WAL segment, subsequent recovery might cause
>> the following fatal error. This is because minSafeStartPoint indicates
>> the end of the last unfilled WAL segment which subsequent recovery
>> cannot reach. Is this bug? (I'm not sure how to fix this problem
>> because I don't understand yet why minSafeStartPoint is required.)
>>
>>> FATAL: WAL ends before end time of backup dump
>
> I think you're right. We need a couple of changes to avoid confusing
> messages.

Hmm, we could update minSafeStartPoint in XLogFlush instead. That was
suggested when the idea of minSafeStartPoint was first thought of.
Updating minSafeStartPoint is analogous to flushing WAL:
minSafeStartPoint must be advanced to X before we can flush a data pgse
with LSN X. To avoid excessive controlfile updates, whenever we update
minSafeStartPoint, we can update it to the latest WAL record we've read.

Or we could simply ignore that error if we've reached minSafeStartPoint
- 1 segment, assuming that even though minSafeStartPoint is higher, we
can't have gone past the end of valid WAL records in the last segment in
previous recovery either. But that feels more fragile.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 10:11:56
Message-ID:	1233223916.4703.15.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-01-29 at 11:20 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2009-01-29 at 10:36 +0900, Fujii Masao wrote:
> >> Hi,
> >>
> >> On Wed, Jan 28, 2009 at 11:19 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> >>>> I feel quite good about this patch now. Given the amount of code churn, it
> >>>> requires testing, and I'll read it through one more time after sleeping over
> >>>> it. Simon, do you see anything wrong with this?
> >>> I also read this patch and found something odd. I apologize if I misread it..
> >> If archive recovery fails after it reaches the last valid record
> >> in the last unfilled WAL segment, subsequent recovery might cause
> >> the following fatal error. This is because minSafeStartPoint indicates
> >> the end of the last unfilled WAL segment which subsequent recovery
> >> cannot reach. Is this bug? (I'm not sure how to fix this problem
> >> because I don't understand yet why minSafeStartPoint is required.)
> >>
> >>> FATAL: WAL ends before end time of backup dump
> >
> > I think you're right. We need a couple of changes to avoid confusing
> > messages.
>
> Hmm, we could update minSafeStartPoint in XLogFlush instead. That was
> suggested when the idea of minSafeStartPoint was first thought of.
> Updating minSafeStartPoint is analogous to flushing WAL:
> minSafeStartPoint must be advanced to X before we can flush a data pgse
> with LSN X. To avoid excessive controlfile updates, whenever we update
> minSafeStartPoint, we can update it to the latest WAL record we've read.
>
> Or we could simply ignore that error if we've reached minSafeStartPoint
> - 1 segment, assuming that even though minSafeStartPoint is higher, we
> can't have gone past the end of valid WAL records in the last segment in
> previous recovery either. But that feels more fragile.

My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce
another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show
that we are still recovering up to the point of the end of the base
backup. Once we reach minSafeStartPoint we then switch state to
DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then
enables writing new minSafeStartPoints when we open new WAL files in the
future.

We then have messages only when in DB_IN_ARCHIVE_RECOVERY_BASE state

if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint) &&
ControlFile->state == DB_IN_ARCHIVE_RECOVERY_BASE)
{
if (reachedStopPoint) /* stopped because of stop request */
ereport(FATAL,
(errmsg("requested recovery stop point is before end time of
backup dump")));
else /* ran off end of WAL */
ereport(FATAL,
(errmsg("WAL ends before end time of backup dump")));
}

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 10:22:19
Message-ID:	4981835B.2080002@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce
> another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show
> that we are still recovering up to the point of the end of the base
> backup. Once we reach minSafeStartPoint we then switch state to
> DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then
> enables writing new minSafeStartPoints when we open new WAL files in the
> future.

I don't see how that helps, the bug has nothing to with base backups. It
comes from the fact that we set minSafeStartPoint beyond the actual end
of WAL, if the last WAL segment is only partially filled (= fails CRC
check at some point). If we crash after setting minSafeStartPoint like
that, and then restart recovery, we'll get the error.

The last WAL segment could be partially filled for example because the
DBA has manually copied the last unarchived WAL segments to pg_xlog, as
we recommend in the manual.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 12:21:11
Message-ID:	49819F37.8020306@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

It looks like if you issue a fast shutdown during recovery, postmaster
doesn't kill bgwriter.

...
LOG: restored log file "000000010000000000000028" from archive
LOG: restored log file "000000010000000000000029" from archive
LOG: consistent recovery state reached at 0/2900005C
...
LOG: restored log file "00000001000000000000002F" from archive
LOG: restored log file "000000010000000000000030" from archive
LOG: restored log file "000000010000000000000031" from archive
LOG: restored log file "000000010000000000000032" from archive
LOG: restored log file "000000010000000000000033" from archive
LOG: restartpoint starting: time
LOG: restored log file "000000010000000000000034" from archive
LOG: received fast shutdown request
LOG: restored log file "000000010000000000000035" from archive
FATAL: terminating connection due to administrator command
LOG: startup process (PID 14137) exited with exit code 1
LOG: aborting startup due to startup process failure
hlinnaka(at)heikkilaptop:~/pgsql$
hlinnaka(at)heikkilaptop:~/pgsql$ LOG: restartpoint complete: wrote 3324
buffers (5.1%); write=13.996 s, sync=2.016 s, total=16.960 s
LOG: recovery restart point at 0/3000FA14

Seems that reaper() needs to be taught that bgwriter can be active
during consistent recovery. I'll take a look at how to do that.

BTW, the message "terminating connection ..." is a bit misleading. It's
referring to the startup process, which is hardly a connection. We have
that in CVS HEAD too, so it's not something introduced by the patch, but
seems worth changing in HS, since we then let real connections in while
startup process is still running.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 12:54:42
Message-ID:	1233233682.3983.5.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-01-29 at 12:22 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > My proposed fix for Fujii-san's minSafeStartPoint bug is to introduce
> > another control file state DB_IN_ARCHIVE_RECOVERY_BASE. This would show
> > that we are still recovering up to the point of the end of the base
> > backup. Once we reach minSafeStartPoint we then switch state to
> > DB_IN_ARCHIVE_RECOVERY, and set baseBackupReached boolean, which then
> > enables writing new minSafeStartPoints when we open new WAL files in the
> > future.
>
> I don't see how that helps, the bug has nothing to with base backups.

Sorry, disagree.

> It
> comes from the fact that we set minSafeStartPoint beyond the actual end
> of WAL, if the last WAL segment is only partially filled (= fails CRC
> check at some point). If we crash after setting minSafeStartPoint like
> that, and then restart recovery, we'll get the error.

Look again please. My proposal would avoid the error when it is not
relevant, yet keep it when it is (while recovering base backups).

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 13:31:49
Message-ID:	4981AFC5.9080809@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-01-29 at 12:22 +0200, Heikki Linnakangas wrote:
>> It
>> comes from the fact that we set minSafeStartPoint beyond the actual end
>> of WAL, if the last WAL segment is only partially filled (= fails CRC
>> check at some point). If we crash after setting minSafeStartPoint like
>> that, and then restart recovery, we'll get the error.
>
> Look again please. My proposal would avoid the error when it is not
> relevant, yet keep it when it is (while recovering base backups).

I fail to see what base backups have to do with this. The problem arises
in this scenario:

0. A base backup is unzipped. recovery.conf is copied in place, and the
remaining unarchived WAL segments are copied from the primary server to
pg_xlog. The last WAL segment is only partially filled. Let's say that
redo point is in WAL segment 1. The last, partial, WAL segment is 3, and
WAL ends at 0/3500000
1. postmaster is started, recovery starts.
2. WAL segment 1 is restored from archive.
3. We reach consistent recovery point
4. We restore WAL segment 2 from archive. minSafeStartPoint is advanced
to 0/3000000
5. WAL segment 2 is completely replayed, we move on to WAL segment 3. It
is not in archive, but it's found in pg_xlog. minSafeStartPoint is
advanced to 0/4000000. Note that that's beyond end of WAL.
6. At replay of WAL record 0/3200000, the recovery is interrupted. For
example, by a fast shutdown request, or crash.

Now when we restart the recovery, we will never reach minSafeStartPoint,
which is now 0/4000000, and we'll fail with the error that Fujii-san
pointed out. We're already way past the min recovery point of base
backup by then.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 14:10:51
Message-ID:	1233238251.3983.19.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-01-29 at 15:31 +0200, Heikki Linnakangas wrote:

> Now when we restart the recovery, we will never reach
> minSafeStartPoint, which is now 0/4000000, and we'll fail with the
> error that Fujii-san pointed out. We're already way past the min
> recovery point of base backup by then.

The problem was that we reported this error

FATAL: WAL ends before end time of backup dump

and this is inappropriate because, as you say, we are way past the min
recovery point of base backup.

If you look again at my proposal you will see that the proposal avoids
the above error by keeping track of whether we are past the point of
base backup or not. If we are still in base backup we get the error and
if we are passed it we do not.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 14:33:55
Message-ID:	4981BE53.2060308@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-01-29 at 15:31 +0200, Heikki Linnakangas wrote:
>
>> Now when we restart the recovery, we will never reach
>> minSafeStartPoint, which is now 0/4000000, and we'll fail with the
>> error that Fujii-san pointed out. We're already way past the min
>> recovery point of base backup by then.
>
> The problem was that we reported this error
>
> FATAL: WAL ends before end time of backup dump
>
> and this is inappropriate because, as you say, we are way past the min
> recovery point of base backup.
>
> If you look again at my proposal you will see that the proposal avoids
> the above error by keeping track of whether we are past the point of
> base backup or not. If we are still in base backup we get the error and
> if we are passed it we do not.

Oh, we would simply ignore the fact that we haven't reached the
minSafeStartPoint at the end of recovery, and start up anyway. Ok, that
would avoid the problem Fujii-san described. It's like my suggestion of
ignoring the message if we're at minSafeStartPoint - 1 segment, just
more lenient. I don't understand why you'd need a new control file
state, though.

You'd lose the extra protection minSafeStartPoint gives, though. For
example, if you interrupt recovery and move recovery point backwards, we
could refuse to start up when it's not safe to do so. It's currently a
"don't do that!" case, but we could protect against that with
minSafeStartPoint.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 17:20:20
Message-ID:	4981E554.9030907@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> It looks like if you issue a fast shutdown during recovery, postmaster
> doesn't kill bgwriter.

Hmm, seems like we haven't thought through how shutdown during
consistent recovery is supposed to behave in general. Right now, smart
shutdown doesn't do anything during consistent recovery, because the
startup process will just keep going. And fast shutdown will simply
ExitPostmaster(1), which is clearly not right.

I'm thinking that in both smart and fast shutdown, the startup process
should exit in a controlled way as soon as it's finished with the
current WAL record, and set minSafeStartPoint to the current point in
the replay.

I wonder if bgwriter should perform a restartpoint before exiting?
You'll have to start with recovery on the next startup anyway, but at
least we could minimize the amount of WAL that needs to be replayed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-29 18:35:12
Message-ID:	4981F6E0.2040503@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> On Thu, 2009-01-29 at 15:31 +0200, Heikki Linnakangas wrote:
>>
>>> Now when we restart the recovery, we will never reach
>>> minSafeStartPoint, which is now 0/4000000, and we'll fail with the
>>> error that Fujii-san pointed out. We're already way past the min
>>> recovery point of base backup by then.
>>
>> The problem was that we reported this error
>>
>> FATAL: WAL ends before end time of backup dump
>>
>> and this is inappropriate because, as you say, we are way past the min
>> recovery point of base backup.
>>
>> If you look again at my proposal you will see that the proposal avoids
>> the above error by keeping track of whether we are past the point of
>> base backup or not. If we are still in base backup we get the error and
>> if we are passed it we do not.
>
> Oh, we would simply ignore the fact that we haven't reached the
> minSafeStartPoint at the end of recovery, and start up anyway. Ok, that
> would avoid the problem Fujii-san described. It's like my suggestion of
> ignoring the message if we're at minSafeStartPoint - 1 segment, just
> more lenient. I don't understand why you'd need a new control file
> state, though.
>
> You'd lose the extra protection minSafeStartPoint gives, though. For
> example, if you interrupt recovery and move recovery point backwards, we
> could refuse to start up when it's not safe to do so. It's currently a
> "don't do that!" case, but we could protect against that with
> minSafeStartPoint.

Hmm, another point of consideration is how this interacts with the
pause/continue. In particular, it was suggested earlier that you could
put an option into recovery.conf to start in paused mode. If you pause
recovery, and then stop and restart the server, and have that option in
recovery.conf, I would expect that when you enter consistent recovery
you're at the exact same paused location as before stopping the server.
The upshot of that is that we need to set minSafeStartPoint to that
exact location, at least when you pause & stop in a controlled fashion.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-30 09:33:19
Message-ID:	4982C95F.5060306@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I just realized that the new minSafeStartPoint is actually exactly the
same concept as the existing minRecoveryPoint. As the recovery
progresses, we could advance minRecoveryPoint just as well as the new
minSafeStartPoint.

Perhaps it's a good idea to keep them separate anyway though, the
original minRecoveryPoint might be a useful debugging aid. Or what do
you think?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-30 10:34:21
Message-ID:	1233311661.3983.29.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-01-29 at 20:35 +0200, Heikki Linnakangas wrote:
> Hmm, another point of consideration is how this interacts with the
> pause/continue. In particular, it was suggested earlier that you
> could
> put an option into recovery.conf to start in paused mode. If you
> pause
> recovery, and then stop and restart the server, and have that option
> in
> recovery.conf, I would expect that when you enter consistent recovery
> you're at the exact same paused location as before stopping the
> server.
> The upshot of that is that we need to set minSafeStartPoint to that
> exact location, at least when you pause & stop in a controlled
> fashion.

OK, makes sense.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-30 10:47:55
Message-ID:	1233312475.8859.7.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-01-29 at 19:20 +0200, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > It looks like if you issue a fast shutdown during recovery, postmaster
> > doesn't kill bgwriter.
>
> Hmm, seems like we haven't thought through how shutdown during
> consistent recovery is supposed to behave in general. Right now, smart
> shutdown doesn't do anything during consistent recovery, because the
> startup process will just keep going. And fast shutdown will simply
> ExitPostmaster(1), which is clearly not right.

That whole area was something I was leaving until last, since immediate
shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
this before Christmas, briefly).

> I'm thinking that in both smart and fast shutdown, the startup process
> should exit in a controlled way as soon as it's finished with the
> current WAL record, and set minSafeStartPoint to the current point in
> the replay.

That makes sense, though isn't required.

> I wonder if bgwriter should perform a restartpoint before exiting?
> You'll have to start with recovery on the next startup anyway, but at
> least we could minimize the amount of WAL that needs to be replayed.

That seems like extra work for no additional benefit.

I think we're beginning to blur the lines between review and you just
adding some additional stuff in this area. There's nothing to stop you
doing further changes after this has been committed. We can also commit
what we have with some caveats also, i.e. commit in pieces.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-30 11:06:25
Message-ID:	1233313585.8859.15.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2009-01-30 at 11:33 +0200, Heikki Linnakangas wrote:
> I just realized that the new minSafeStartPoint is actually exactly the
> same concept as the existing minRecoveryPoint. As the recovery
> progresses, we could advance minRecoveryPoint just as well as the new
> minSafeStartPoint.
>
> Perhaps it's a good idea to keep them separate anyway though, the
> original minRecoveryPoint might be a useful debugging aid. Or what do
> you think?

I think we've been confusing ourselves substantially. The patch already
has everything it needs, but there is a one-line-fixable bug where
Fujii-san says.

The code comments already explain how this works

* There are two points in the log that we must pass. The first
* is minRecoveryPoint, which is the LSN at the time the
* base backup was taken that we are about to rollforward from.
* If recovery has ever crashed or was stopped there is also
* another point also: minSafeStartPoint, which we know the
* latest LSN that recovery could have reached prior to crash.

The later message

FATAL WAL ends before end time of backup dump

was originally triggered if

if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint))

and I changed that. Now I look at it again, I see that the original if
test, shown above, is correct and should not have been changed.

Other than that, I don't see the need for further change. Heikki's
suggestions to write a new minSafeStartPoint are good ones and fit
within the existing mechanisms and meanings of these variables.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-30 11:11:17
Message-ID:	1233313877.8859.20.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-01-29 at 14:21 +0200, Heikki Linnakangas wrote:
> It looks like if you issue a fast shutdown during recovery, postmaster
> doesn't kill bgwriter.

Thanks for the report.

I'm thinking to add a new function that will allow crash testing easier.

pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY,
which when replayed will just throw a FATAL error and crash Startup
process. We won't be adding that to the user docs...

This will allow us to produce tests that crash the server at specific
places, rather than trying to trap those points manually.

> Seems that reaper() needs to be taught that bgwriter can be active
> during consistent recovery. I'll take a look at how to do that.
>
>
> BTW, the message "terminating connection ..." is a bit misleading. It's
> referring to the startup process, which is hardly a connection. We have
> that in CVS HEAD too, so it's not something introduced by the patch, but
> seems worth changing in HS, since we then let real connections in while
> startup process is still running.
>
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-30 11:15:47
Message-ID:	4982E163.7030805@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> I'm thinking to add a new function that will allow crash testing easier.
>
> pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY,
> which when replayed will just throw a FATAL error and crash Startup
> process. We won't be adding that to the user docs...
>
> This will allow us to produce tests that crash the server at specific
> places, rather than trying to trap those points manually.

Heh, talk about a footgun ;-). I don't think including that in CVS is a
good idea.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-30 11:25:23
Message-ID:	4982E3A3.7070809@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-01-29 at 19:20 +0200, Heikki Linnakangas wrote:
>> Hmm, seems like we haven't thought through how shutdown during
>> consistent recovery is supposed to behave in general. Right now, smart
>> shutdown doesn't do anything during consistent recovery, because the
>> startup process will just keep going. And fast shutdown will simply
>> ExitPostmaster(1), which is clearly not right.
>
> That whole area was something I was leaving until last, since immediate
> shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
> this before Christmas, briefly).

We must handle shutdown gracefully, can't just leave bgwriter running
after postmaster exit.

Hmm, why does pg_standby catch SIGQUIT? Seems it could just let it kill
the process.

>> I wonder if bgwriter should perform a restartpoint before exiting?
>> You'll have to start with recovery on the next startup anyway, but at
>> least we could minimize the amount of WAL that needs to be replayed.
>
> That seems like extra work for no additional benefit.
>
> I think we're beginning to blur the lines between review and you just
> adding some additional stuff in this area. There's nothing to stop you
> doing further changes after this has been committed.

Sure. I think the "shutdown restartpoint" might actually fall out of the
way the code is structured anyway: bgwriter normally performs a
checkpoint before exiting.

> We can also commit
> what we have with some caveats also, i.e. commit in pieces.

This late in the release cycle, I don't want to commit anything that we
would have to rip out if we run out of time. There is no difference from
review or testing point of view whether the code is in CVS or not.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-30 14:55:46
Message-ID:	498314F2.8030005@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ok, here's an attempt to make shutdown work gracefully.

Startup process now signals postmaster three times during startup: first
when it has done all the initialization, and starts redo. At that point.
postmaster launches bgwriter, which starts to perform restartpoints when
it deems appropriate. The 2nd time signals when we've reached consistent
recovery state. As the patch stands, that's not significant, but it will
be with all the rest of the hot standby stuff. The 3rd signal is sent
when startup process has finished recovery. Postmaster used to wait for
the startup process to exit, and check the return code to determine
that, but now that we support shutdown, startup process also returns
with 0 exit code when it has been requested to terminate.

The startup process now catches SIGTERM, and calls proc_exit() at the
next WAL record. That's what will happen in a fast shutdown. Unexpected
death of the startup process is treated the same as a backend/auxiliary
process crash.

InitXLogAccess is now called in IsRecoeryProcessingMode() as you suggested.

Simon Riggs wrote:
> On Thu, 2009-01-29 at 19:20 +0200, Heikki Linnakangas wrote:
>> Heikki Linnakangas wrote:
>>> It looks like if you issue a fast shutdown during recovery, postmaster
>>> doesn't kill bgwriter.
>> Hmm, seems like we haven't thought through how shutdown during
>> consistent recovery is supposed to behave in general. Right now, smart
>> shutdown doesn't do anything during consistent recovery, because the
>> startup process will just keep going. And fast shutdown will simply
>> ExitPostmaster(1), which is clearly not right.
>
> That whole area was something I was leaving until last, since immediate
> shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
> this before Christmas, briefly).
>
>> I'm thinking that in both smart and fast shutdown, the startup process
>> should exit in a controlled way as soon as it's finished with the
>> current WAL record, and set minSafeStartPoint to the current point in
>> the replay.
>
> That makes sense, though isn't required.
>
>> I wonder if bgwriter should perform a restartpoint before exiting?
>> You'll have to start with recovery on the next startup anyway, but at
>> least we could minimize the amount of WAL that needs to be replayed.
>
> That seems like extra work for no additional benefit.
>
> I think we're beginning to blur the lines between review and you just
> adding some additional stuff in this area. There's nothing to stop you
> doing further changes after this has been committed. We can also commit
> what we have with some caveats also, i.e. commit in pieces.
>

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
recovery-infra-531534c.patch	text/x-diff	57.7 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-31 09:25:49
Message-ID:	1233393949.4500.11.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2009-01-30 at 13:15 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > I'm thinking to add a new function that will allow crash testing easier.
> >
> > pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY,
> > which when replayed will just throw a FATAL error and crash Startup
> > process. We won't be adding that to the user docs...
> >
> > This will allow us to produce tests that crash the server at specific
> > places, rather than trying to trap those points manually.
>
> Heh, talk about a footgun ;-). I don't think including that in CVS is a
> good idea.

Thought you'd like it. I'd have preferred it in a plugin... :-(

Not really sure why its a problem though. We support
pg_ctl stop -m immediate, which is the equivalent, yet we don't regard
that as a footgun.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-31 09:27:08
Message-ID:	1233394028.4500.13.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2009-01-30 at 13:25 +0200, Heikki Linnakangas wrote:
> > That whole area was something I was leaving until last, since
> immediate
> > shutdown doesn't work either, even in HEAD. (Fujii-san and I
> discussed
> > this before Christmas, briefly).
>
> We must handle shutdown gracefully, can't just leave bgwriter running
> after postmaster exit.

Nobody was suggesting we should.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-31 09:57:27
Message-ID:	1233395847.4500.27.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2009-01-30 at 16:55 +0200, Heikki Linnakangas wrote:
> Ok, here's an attempt to make shutdown work gracefully.
>
> Startup process now signals postmaster three times during startup: first
> when it has done all the initialization, and starts redo. At that point.
> postmaster launches bgwriter, which starts to perform restartpoints when
> it deems appropriate. The 2nd time signals when we've reached consistent
> recovery state. As the patch stands, that's not significant, but it will
> be with all the rest of the hot standby stuff. The 3rd signal is sent
> when startup process has finished recovery. Postmaster used to wait for
> the startup process to exit, and check the return code to determine
> that, but now that we support shutdown, startup process also returns
> with 0 exit code when it has been requested to terminate.

Yeh, seems much cleaner.

Slightly bizarre though cos now we're pretty much back to my originally
proposed design. C'est la vie.

I like this way because it means we might in the future get Startup
process to perform post-recovery actions also.

> The startup process now catches SIGTERM, and calls proc_exit() at the
> next WAL record. That's what will happen in a fast shutdown. Unexpected
> death of the startup process is treated the same as a backend/auxiliary
> process crash.

Good. Like your re-arrangement of StartupProcessMain also.

Your call to PMSIGNAL_RECOVERY_COMPLETED needs to be if
(IsUnderPostmaster), or at least a comment to explain why not or perhaps
an Assert.

Think you need to just throw away this chunk

@@ -5253,7 +5386,7 @@ StartupXLOG(void)
* Complain if we did not roll forward far enough to render the
backup
* dump consistent.
*/
- if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint))
+ if (InRecovery && !reachedSafeStartPoint)
{
if (reachedStopPoint) /* stopped because of stop
request */
ereport(FATAL,

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-31 20:32:43
Message-ID:	4984B56B.4050202@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Fri, 2009-01-30 at 13:15 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> I'm thinking to add a new function that will allow crash testing easier.
>>>
>>> pg_crash_standby() will issue a new xlog record, XLOG_CRASH_STANDBY,
>>> which when replayed will just throw a FATAL error and crash Startup
>>> process. We won't be adding that to the user docs...
>>>
>>> This will allow us to produce tests that crash the server at specific
>>> places, rather than trying to trap those points manually.
>> Heh, talk about a footgun ;-). I don't think including that in CVS is a
>> good idea.
>
> Thought you'd like it. I'd have preferred it in a plugin... :-(
>
> Not really sure why its a problem though. We support
> pg_ctl stop -m immediate, which is the equivalent, yet we don't regard
> that as a footgun.

If you poison your WAL archive with a XLOG_CRASH_RECOVERY record,
recovery will never be able to proceed over that point. There would have
to be a switch to ignore those records, at the very least.

pg_ctl stop -m immediate has some use in a production system, while this
would be a pure development aid. For that purpose, you might as use a
patched version.

Presumably you want to test different kind of crashes and at different
points. FATAL, PANIC, segfault etc. A single crash mechanism like that
will only test one path.

You don't really need to do it with a new WAL record. You could just add
a GUC or recovery.conf option along the lines of recovery_target:
crash_target=0/123456, and check for that in ReadRecord or wherever you
want the crash to occur.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-01-31 20:41:26
Message-ID:	4984B776.4050807@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Fri, 2009-01-30 at 16:55 +0200, Heikki Linnakangas wrote:
>> Ok, here's an attempt to make shutdown work gracefully.
>>
>> Startup process now signals postmaster three times during startup: first
>> when it has done all the initialization, and starts redo. At that point.
>> postmaster launches bgwriter, which starts to perform restartpoints when
>> it deems appropriate. The 2nd time signals when we've reached consistent
>> recovery state. As the patch stands, that's not significant, but it will
>> be with all the rest of the hot standby stuff. The 3rd signal is sent
>> when startup process has finished recovery. Postmaster used to wait for
>> the startup process to exit, and check the return code to determine
>> that, but now that we support shutdown, startup process also returns
>> with 0 exit code when it has been requested to terminate.
>
> Yeh, seems much cleaner.
>
> Slightly bizarre though cos now we're pretty much back to my originally
> proposed design. C'est la vie.

Yep. I didn't see any objections to that approach in the archives. There
was other problems in the early versions of the patch, but nothing
related to this arrangement.

> I like this way because it means we might in the future get Startup
> process to perform post-recovery actions also.

Yeah, it does. Do you have something in mind already?

> Your call to PMSIGNAL_RECOVERY_COMPLETED needs to be if
> (IsUnderPostmaster), or at least a comment to explain why not or perhaps
> an Assert.

Nah, StartupProcessMain is only run under postmaster; you don't want to
install signal handlers in a stand-along backend. Stand-alone backend
calls StartupXLOG directly.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-01 08:28:57
Message-ID:	1233476937.4500.48.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2009-01-31 at 22:32 +0200, Heikki Linnakangas wrote:

> If you poison your WAL archive with a XLOG_CRASH_RECOVERY record,
> recovery will never be able to proceed over that point. There would have
> to be a switch to ignore those records, at the very least.

Definitely in assert mode only.

I'll do it as a test patch and keep it separate from main line.

> You don't really need to do it with a new WAL record. You could just add
> a GUC or recovery.conf option along the lines of recovery_target:
> crash_target=0/123456, and check for that in ReadRecord or wherever you
> want the crash to occur.

Knowing that LSN is somewhat harder

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-01 08:32:01
Message-ID:	1233477121.4500.51.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2009-01-31 at 22:41 +0200, Heikki Linnakangas wrote:

> > I like this way because it means we might in the future get Startup
> > process to perform post-recovery actions also.
>
> Yeah, it does. Do you have something in mind already?

Yes, but nothing that needs to be discussed yet.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-04 05:19:05
Message-ID:	3f0b79eb0902032119l1f07b293ha4c26bf485c63397@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, Jan 30, 2009 at 11:55 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> The startup process now catches SIGTERM, and calls proc_exit() at the next
> WAL record. That's what will happen in a fast shutdown. Unexpected death of
> the startup process is treated the same as a backend/auxiliary process
> crash.

If unexpected death of the startup process happens in automatic recovery
after a crash, postmaster and bgwriter may get stuck. Because HandleChildCrash()
can be called before FatalError flag is reset. When FatalError is false,
HandleChildCrash() doesn't kill any auxiliary processes. So, bgwriter survives
the crash and postmaster waits for the death of bgwriter forever with recovery
status (which means that new connection cannot be started). Is this bug?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-04 11:35:19
Message-ID:	49897D77.503@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> On Fri, Jan 30, 2009 at 11:55 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> The startup process now catches SIGTERM, and calls proc_exit() at the next
>> WAL record. That's what will happen in a fast shutdown. Unexpected death of
>> the startup process is treated the same as a backend/auxiliary process
>> crash.
>
> If unexpected death of the startup process happens in automatic recovery
> after a crash, postmaster and bgwriter may get stuck. Because HandleChildCrash()
> can be called before FatalError flag is reset. When FatalError is false,
> HandleChildCrash() doesn't kill any auxiliary processes. So, bgwriter survives
> the crash and postmaster waits for the death of bgwriter forever with recovery
> status (which means that new connection cannot be started). Is this bug?

Yes, and in fact I ran into it myself yesterday while testing. It seems
that we should reset FatalError earlier, ie. when the recovery starts
and bgwriter is launched. I'm not sure why we in CVS HEAD we don't reset
FatalError until after the startup process is finished. Resetting it as
soon all the processes have been terminated and startup process is
launched again would seem like a more obvious place to do it. The only
difference that I can see is that if someone tries to connect while the
startup process is running, you now get a "the database system is in
recovery mode" message instead of "the database system is starting up"
if we're reinitializing after crash. We can keep that behavior, just
need to add another flag to mean "reinitializing after crash" that isn't
reset until the recovery is over.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-04 17:03:49
Message-ID:	4989CA75.4030206@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> * I think we are now renaming the recovery.conf file too early. The
> comment says "We have already restored all the WAL segments we need from
> the archive, and we trust that they are not going to go away even if we
> crash." We have, but the files overwrite each other as they arrive, so
> if the last restartpoint is not in the last restored WAL file then it
> will only exist in the archive. The recovery.conf is the only place
> where we store the information on where the archive is and how to access
> it, so by renaming it out of the way we will be unable to crash recover
> until the first checkpoint is complete. So the way this was in the
> original patch is the correct way to go, AFAICS.

I can see what you mean now. In fact we're not safe even when the last
restartpoint is in the last restored WAL file, because we always restore
segments from the archive to a temporary filename.

Yes, renaming recovery.conf at the first checkpoint avoids that problem.
However, it means that we'll re-enter archive recovery if we crash
before that checkpoint is finished. Won't that cause havoc if more files
have appeared to the archive since the crash, and we restore those even
though we already started up from an earlier point? How do the timelines
work in that case?

We could avoid that by performing a good old startup checkpoint, but I
quite like the fast failover time we get without it.

> * my original intention was to deprecate log_restartpoints and would
> still like to do so. log_checkpoints does just as well for that. Even
> less code than before...

Ok, done.

> * comment on BgWriterShmemInit() refers to CHECKPOINT_IS_STARTUP, but
> the actual define is CHECKPOINT_STARTUP. Would prefer the "is" version
> because it matches the IS_SHUTDOWN naming.

Fixed.

> * In CreateCheckpoint() the if test on TruncateSubtrans() has been
> removed, but the comment has not been changed (to explain why).

Thanks, comment updated. (we now call CreateCheckPoint() only after
recovery is finished)

> We should continue to measure performance of recovery in the light of
> these changes. I still feel that fsyncing the control file on each
> XLogFileRead() will give a noticeable performance penalty, mostly
> because we know doing exactly the same thing in normal running caused a
> performance penalty. But that is easily changed and cannot be done with
> any certainty without wider feedback, so no reason to delay code commit.

I've changed the way minRecoveryPoint is updated now anyway, so it no
longer happens every XLogFileRead().

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-04 18:25:31
Message-ID:	1233771931.4500.325.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-02-04 at 19:03 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > * I think we are now renaming the recovery.conf file too early. The
> > comment says "We have already restored all the WAL segments we need from
> > the archive, and we trust that they are not going to go away even if we
> > crash." We have, but the files overwrite each other as they arrive, so
> > if the last restartpoint is not in the last restored WAL file then it
> > will only exist in the archive. The recovery.conf is the only place
> > where we store the information on where the archive is and how to access
> > it, so by renaming it out of the way we will be unable to crash recover
> > until the first checkpoint is complete. So the way this was in the
> > original patch is the correct way to go, AFAICS.
>
> I can see what you mean now. In fact we're not safe even when the last
> restartpoint is in the last restored WAL file, because we always restore
> segments from the archive to a temporary filename.
>
> Yes, renaming recovery.conf at the first checkpoint avoids that problem.
> However, it means that we'll re-enter archive recovery if we crash
> before that checkpoint is finished. Won't that cause havoc if more files
> have appeared to the archive since the crash, and we restore those even
> though we already started up from an earlier point? How do the timelines
> work in that case?

More archive files being added to archive is possible, but unlikely.
What *is* a definite problem is that if we ended recovery by manual
command (possible with HS patch) or recovery target then we would have
no record of which point it was that we finished at. It is then possible
that the recovery.conf has been re-edited, causing re-recovery to end at
a different place. That seems like a bad thing.

> We could avoid that by performing a good old startup checkpoint, but I
> quite like the fast failover time we get without it.

ISTM it's either slow failover or (fast failover, but restart archive
recovery if crashes).

I would suggest that at end of recovery we write the last LSN to the
control file, so if we crash recover then we will always end archive
recovery at the same place again should we re-enter it. So we would have
a recovery_target_lsn that overrides recovery_target_xid etc..

Given where we are, I would suggest we go for the slow failover option
in this release. Doing otherwise is a risky decision with little gain.
BGwriter-in-recovery is a good thing of itself and we have other code to
review yet with a higher importance level, and the rest of HS is code
I'm actually much happier with. I've spent weeks trying to get the
transition between recovery and normal running clean, but it seems like
time to say I didn't get it right *enough* to be absolutely certain.
(Just the fast failover part of patch!). Thanks for the review.

> > We should continue to measure performance of recovery in the light
> of
> > these changes. I still feel that fsyncing the control file on each
> > XLogFileRead() will give a noticeable performance penalty, mostly
> > because we know doing exactly the same thing in normal running
> caused a
> > performance penalty. But that is easily changed and cannot be done
> with
> > any certainty without wider feedback, so no reason to delay code
> commit.
>
> I've changed the way minRecoveryPoint is updated now anyway, so it no
> longer happens every XLogFileRead().

Care to elucidate?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 01:35:38
Message-ID:	3f0b79eb0902041735w15f2536dufd7ba327eff076db@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Wed, Feb 4, 2009 at 8:35 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Yes, and in fact I ran into it myself yesterday while testing. It seems that
> we should reset FatalError earlier, ie. when the recovery starts and
> bgwriter is launched. I'm not sure why we in CVS HEAD we don't reset
> FatalError until after the startup process is finished. Resetting it as soon
> all the processes have been terminated and startup process is launched again
> would seem like a more obvious place to do it.

Which may repeat the recovery crash and reinitializing forever. To prevent
this problem, unexpected death of startup process should not cause
reinitializing?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 06:10:29
Message-ID:	13365.1233814229@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> On Wed, Feb 4, 2009 at 8:35 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> ... I'm not sure why we in CVS HEAD we don't reset
>> FatalError until after the startup process is finished.

> Which may repeat the recovery crash and reinitializing forever. To prevent
> this problem, unexpected death of startup process should not cause
> reinitializing?

Fujii-san has it in one.

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 07:25:04
Message-ID:	498A9450.3030001@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
>> On Wed, Feb 4, 2009 at 8:35 PM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> ... I'm not sure why we in CVS HEAD we don't reset
>>> FatalError until after the startup process is finished.
>
>> Which may repeat the recovery crash and reinitializing forever. To prevent
>> this problem, unexpected death of startup process should not cause
>> reinitializing?
>
> Fujii-san has it in one.

In CVS HEAD, we always do ExitPostmaster(1) if the startup process dies
unexpectedly, regardless of FatalError. So it serves no such purpose there.

But yeah, we do have that problem with the patch. What do we want to do
if the startup process dies with FATAL? It seems reasonable to assume
that the problem isn't going to just go away if we restart: we'd do
exactly the same sequence of actions after restart.

But if it's after reaching consistent recovery, the system should still
be in consistent state, and we could stay open for read-only
transactions. That would be useful to recover a corrupted database/WAL;
you could let the WAL replay to run as far as it can, and then connect
and investigate the situation, or run pg_dump.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 07:28:15
Message-ID:	498A950F.6000009@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
>> We could avoid that by performing a good old startup checkpoint, but I
>> quite like the fast failover time we get without it.
>
> ISTM it's either slow failover or (fast failover, but restart archive
> recovery if crashes).
>
> I would suggest that at end of recovery we write the last LSN to the
> control file, so if we crash recover then we will always end archive
> recovery at the same place again should we re-enter it. So we would have
> a recovery_target_lsn that overrides recovery_target_xid etc..

Hmm, we don't actually want to end recovery at the same point again: if
there's some updates right after the database came up, but before the
first checkpoint and crash, those actions need to be replayed too.

> Given where we are, I would suggest we go for the slow failover option
> in this release.

Agreed. We could do it for crash recovery, but I'd rather not have two
different ways to do it. It's not as important for crash recovery, either.

>>> We should continue to measure performance of recovery in the light
>> of
>>> these changes. I still feel that fsyncing the control file on each
>>> XLogFileRead() will give a noticeable performance penalty, mostly
>>> because we know doing exactly the same thing in normal running
>> caused a
>>> performance penalty. But that is easily changed and cannot be done
>> with
>>> any certainty without wider feedback, so no reason to delay code
>> commit.
>>
>> I've changed the way minRecoveryPoint is updated now anyway, so it no
>> longer happens every XLogFileRead().
>
> Care to elucidate?

I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And
it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated
patch soon.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 07:42:21
Message-ID:	1233819741.4500.351.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:

> >> I've changed the way minRecoveryPoint is updated now anyway, so it no
> >> longer happens every XLogFileRead().
> >
> > Care to elucidate?
>
> I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And
> it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated
> patch soon.

Why do you think XLogFlush is called less frequently than XLogFileRead?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 07:46:52
Message-ID:	1233820012.4500.356.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> >> We could avoid that by performing a good old startup checkpoint, but I
> >> quite like the fast failover time we get without it.
> >
> > ISTM it's either slow failover or (fast failover, but restart archive
> > recovery if crashes).
> >
> > I would suggest that at end of recovery we write the last LSN to the
> > control file, so if we crash recover then we will always end archive
> > recovery at the same place again should we re-enter it. So we would have
> > a recovery_target_lsn that overrides recovery_target_xid etc..
>
> Hmm, we don't actually want to end recovery at the same point again: if
> there's some updates right after the database came up, but before the
> first checkpoint and crash, those actions need to be replayed too.

I think we do need to. Crash recovery is supposed to return us to the
same state. Where we ended ArchiveRecovery is part of that state.

It isn't for crash recovery to decide that it saw a few more
transactions and decided to apply them anyway. The user may have
specifically ended recovery to avoid those additional changes from
taking place.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 08:07:36
Message-ID:	498A9E48.6050504@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> I would suggest that at end of recovery we write the last LSN to the
>>> control file, so if we crash recover then we will always end archive
>>> recovery at the same place again should we re-enter it. So we would have
>>> a recovery_target_lsn that overrides recovery_target_xid etc..
>> Hmm, we don't actually want to end recovery at the same point again: if
>> there's some updates right after the database came up, but before the
>> first checkpoint and crash, those actions need to be replayed too.
>
> I think we do need to. Crash recovery is supposed to return us to the
> same state. Where we ended ArchiveRecovery is part of that state.
>
> It isn't for crash recovery to decide that it saw a few more
> transactions and decided to apply them anyway. The user may have
> specifically ended recovery to avoid those additional changes from
> taking place.

Those additional changes were made in the standby, after ending
recovery. So the sequence of events is:

1. Standby performs archive recovery
2. End of archive (or stop point) reached. Open for connections,
read-write. Request an online checkpoint. Online checkpoint begins.
3. A user connects, and does "INSERT INTO foo VALUES (123)". Commits,
commit returns.
4. "pg_ctl stop -m immediate". The checkpoint started in step 2 hasn't
finished yet
5. Restart the server.

The server will now re-enter archive recovery. But this time, we want to
replay the INSERT too.

(let's do the startup checkpoint for now, as discussed elsewhere in this
thread)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 08:31:03
Message-ID:	498AA3C7.706@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
>
>>>> I've changed the way minRecoveryPoint is updated now anyway, so it no
>>>> longer happens every XLogFileRead().
>>> Care to elucidate?
>> I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And
>> it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated
>> patch soon.
>
> Why do you think XLogFlush is called less frequently than XLogFileRead?

It's not, but we only need to update the control file when we're
"flushing" an LSN that's greater than current minRecoveryPoint. And when
we do update minRecoveryPoint, we can update it to the LSN of the last
record we've read from the archive.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 08:59:00
Message-ID:	1233824340.4500.367.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-02-05 at 10:07 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> >>> I would suggest that at end of recovery we write the last LSN to the
> >>> control file, so if we crash recover then we will always end archive
> >>> recovery at the same place again should we re-enter it. So we would have
> >>> a recovery_target_lsn that overrides recovery_target_xid etc..
> >> Hmm, we don't actually want to end recovery at the same point again: if
> >> there's some updates right after the database came up, but before the
> >> first checkpoint and crash, those actions need to be replayed too.
> >
> > I think we do need to. Crash recovery is supposed to return us to the
> > same state. Where we ended ArchiveRecovery is part of that state.
> >
> > It isn't for crash recovery to decide that it saw a few more
> > transactions and decided to apply them anyway. The user may have
> > specifically ended recovery to avoid those additional changes from
> > taking place.
>
> Those additional changes were made in the standby, after ending
> recovery. So the sequence of events is:
>
> 1. Standby performs archive recovery
> 2. End of archive (or stop point) reached. Open for connections,
> read-write. Request an online checkpoint. Online checkpoint begins.
> 3. A user connects, and does "INSERT INTO foo VALUES (123)". Commits,
> commit returns.
> 4. "pg_ctl stop -m immediate". The checkpoint started in step 2 hasn't
> finished yet
> 5. Restart the server.
>
> The server will now re-enter archive recovery. But this time, we want to
> replay the INSERT too.

I agree with this, so let me restate both comments together.

When the server starts it begins a new timeline.

When recovering we must switch to that timeline at the same point we did
previously when we ended archive recovery. We currently don't record
when that is and we need to.

Yes, we must also replay the records in the new timeline once we have
switched to it, as you say. But we must not replay any further in the
older timeline(s).

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 09:26:25
Message-ID:	1233825985.4500.388.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-02-05 at 10:31 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
> >
> >>>> I've changed the way minRecoveryPoint is updated now anyway, so it no
> >>>> longer happens every XLogFileRead().
> >>> Care to elucidate?
> >> I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And
> >> it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated
> >> patch soon.
> >
> > Why do you think XLogFlush is called less frequently than XLogFileRead?
>
> It's not, but we only need to update the control file when we're
> "flushing" an LSN that's greater than current minRecoveryPoint. And when
> we do update minRecoveryPoint, we can update it to the LSN of the last
> record we've read from the archive.

So we might end up flushing more often *and* we will be doing it
potentially in the code path of other users.

This change seems speculative and also against what has previously been
agreed with Tom. If he chooses not to comment on your changes, that's up
to him, but I don't think you should remove things quietly that have
been put there through the community process, as if they caused
problems. I feel like I'm in the middle here.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 09:34:34
Message-ID:	1233826474.4500.394.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-02-05 at 09:26 +0000, Simon Riggs wrote:

> This change seems speculative and also against what has previously been
> agreed with Tom. If he chooses not to comment on your changes, that's up
> to him, but I don't think you should remove things quietly that have
> been put there through the community process, as if they caused
> problems. I feel like I'm in the middle here.

This is only a problem because of two independent reviewers.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 09:46:05
Message-ID:	498AB55D.50408@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-02-05 at 10:31 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Thu, 2009-02-05 at 09:28 +0200, Heikki Linnakangas wrote:
>>>> I got rid of minSafeStartPoint, advancing minRecoveryPoint instead. And
>>>> it's advanced in XLogFlush instead of XLogFileRead. I'll post an updated
>>>> patch soon.
>>> Why do you think XLogFlush is called less frequently than XLogFileRead?
>> It's not, but we only need to update the control file when we're
>> "flushing" an LSN that's greater than current minRecoveryPoint. And when
>> we do update minRecoveryPoint, we can update it to the LSN of the last
>> record we've read from the archive.
>
> So we might end up flushing more often *and* we will be doing it
> potentially in the code path of other users.

For example, imagine a database that fits completely in shared buffers.
If we update at every XLogFileRead, we have to fsync every 16MB of WAL.
If we update in XLogFlush the way I described, you only need to update
when we flush a page from the buffer cache, which will only happen at
restartpoints. That's far less updates.

Expanding that example to a database that doesn't fit in cache, you're
still replacing pages from the buffer cache that have been untouched for
longest. Such pages will have an old LSN, too, so we shouldn't need to
update very often.

I'm sure you can come up with an example of where we end up fsyncing
more often, but it doesn't seem like the common case to me.

I'd like to have the extra protection that this approach gives. If we
let safeStartPoint to be ahead of the actual WAL we've replayed, we have
to just assume we're fine if we reach end of WAL before reaching that
point. That assumption falls down if e.g recovery is stopped, and you go
and remove the last few WAL segments from the archive before restarting
it, or signal pg_standby to trigger failover too early. Tracking the
real safe starting point and enforcing it always protects you from that.

(we did discuss this a week ago:
http://archives.postgresql.org/message-id/4981F6E0.2040503@enterprisedb.com)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 10:33:56
Message-ID:	1233830036.4500.422.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:

> > So we might end up flushing more often *and* we will be doing it
> > potentially in the code path of other users.
>
> For example, imagine a database that fits completely in shared buffers.
> If we update at every XLogFileRead, we have to fsync every 16MB of WAL.
> If we update in XLogFlush the way I described, you only need to update
> when we flush a page from the buffer cache, which will only happen at
> restartpoints. That's far less updates.

Oh, did you change the bgwriter so it doesn't do normal page cleaning?

General thoughts: Latest HS patch has a CPU profile within 1-2% of
current and the use of ProcArrayLock is fairly minimal now. The
additional CPU is recoveryStopsHere(), which enables the manual control
of recovery, so the trade off seems worth it. The major CPU hog remains
RecordIsValid, which is the CRC checks. Startup is still I/O bound.
Specific avoidable I/O hogs are (1) checkpoints, (2) page cleaning so I
hope we can avoid both of those.

I also hope to minimise the I/O cost of keeping track of the consistency
point. If that was done as infrequently as each restartpoint then I
would certainly be very happy, but that won't happen in the proposed
scheme if we do page cleaning.

> Expanding that example to a database that doesn't fit in cache, you're
> still replacing pages from the buffer cache that have been untouched for
> longest. Such pages will have an old LSN, too, so we shouldn't need to
> update very often.

They will tend to be written in ascending LSN order which will mean we
continually update the control file. Anything out of order does skip a
write. The better the cache is at finding LRU blocks out the more writes
we will make.

> I'm sure you can come up with an example of where we end up fsyncing
> more often, but it doesn't seem like the common case to me.

I'm not trying to come up with counterexamples...

> > This change seems speculative and also against what has previously been
> > agreed with Tom. If he chooses not to comment on your changes, that's up
> > to him, but I don't think you should remove things quietly that have
> > been put there through the community process, as if they caused
> > problems. I feel like I'm in the middle here.
>
> I'd like to have the extra protection that this approach gives. If we
> let safeStartPoint to be ahead of the actual WAL we've replayed, we have
> to just assume we're fine if we reach end of WAL before reaching that
> point. That assumption falls down if e.g recovery is stopped, and you go
> and remove the last few WAL segments from the archive before restarting
> it, or signal pg_standby to trigger failover too early. Tracking the
> real safe starting point and enforcing it always protects you from that.

Doing it this way will require you to remove existing specific error
messages about ending before end time of backup, to be replaced by more
general ones that say "consistency not reached" which is harder to
figure out what to do about it.

> (we did discuss this a week ago:
> http://archives.postgresql.org/message-id/4981F6E0.2040503@enterprisedb.com)

Yes, we need to update it in that case. Though that is no way agreement
to the other changes, nor does it require them.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 11:18:37
Message-ID:	498ACB0D.9010307@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>
>>> So we might end up flushing more often *and* we will be doing it
>>> potentially in the code path of other users.
>> For example, imagine a database that fits completely in shared buffers.
>> If we update at every XLogFileRead, we have to fsync every 16MB of WAL.
>> If we update in XLogFlush the way I described, you only need to update
>> when we flush a page from the buffer cache, which will only happen at
>> restartpoints. That's far less updates.
>
> Oh, did you change the bgwriter so it doesn't do normal page cleaning?

No. Ok, that wasn't completely accurate. The page cleaning by bgwriter
will perform XLogFlushes, but that should be pretty insignificant. When
there's little page replacement going on, bgwriter will do a small
trickle of page cleaning, which won't matter much. If there's more page
replacement going on, bgwriter is cleaning up pages that will soon be
replaced, so it's just offsetting work from other backends (or the
startup process in this case).

>> Expanding that example to a database that doesn't fit in cache, you're
>> still replacing pages from the buffer cache that have been untouched for
>> longest. Such pages will have an old LSN, too, so we shouldn't need to
>> update very often.
>
> They will tend to be written in ascending LSN order which will mean we
> continually update the control file. Anything out of order does skip a
> write. The better the cache is at finding LRU blocks out the more writes
> we will make.

When minRecoveryPoint is updated, it's not update to just the LSN that's
being flushed. It's updated to the recptr of the most recently read WAL
record. That's an important point to avoid that behavior. Just like
XLogFlush normally always flushes all of the outstanding WAL, not just
up to the requested LSN.

>> I'd like to have the extra protection that this approach gives. If we
>> let safeStartPoint to be ahead of the actual WAL we've replayed, we have
>> to just assume we're fine if we reach end of WAL before reaching that
>> point. That assumption falls down if e.g recovery is stopped, and you go
>> and remove the last few WAL segments from the archive before restarting
>> it, or signal pg_standby to trigger failover too early. Tracking the
>> real safe starting point and enforcing it always protects you from that.
>
> Doing it this way will require you to remove existing specific error
> messages about ending before end time of backup, to be replaced by more
> general ones that say "consistency not reached" which is harder to
> figure out what to do about it.

Yeah. If that's an important distinction, we could still save the
original backup stop location somewhere, just so that we can give the
old error message when we've not passed that location. But perhaps a
message like "WAL ends before reaching a consistent state" with a hint
"Make sure you archive all the WAL created during backup" or something
would do suffice.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 11:52:54
Message-ID:	1233834774.4500.440.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-02-05 at 13:18 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> >
> >>> So we might end up flushing more often *and* we will be doing it
> >>> potentially in the code path of other users.
> >> For example, imagine a database that fits completely in shared buffers.
> >> If we update at every XLogFileRead, we have to fsync every 16MB of WAL.
> >> If we update in XLogFlush the way I described, you only need to update
> >> when we flush a page from the buffer cache, which will only happen at
> >> restartpoints. That's far less updates.
> >
> > Oh, did you change the bgwriter so it doesn't do normal page cleaning?
>
> No. Ok, that wasn't completely accurate. The page cleaning by bgwriter
> will perform XLogFlushes, but that should be pretty insignificant. When
> there's little page replacement going on, bgwriter will do a small
> trickle of page cleaning, which won't matter much.

Yes, that case is good, but it wasn't the use case we're trying to speed
up by having the bgwriter active during recovery. We're worried about
I/O bound recoveries.

> If there's more page
> replacement going on, bgwriter is cleaning up pages that will soon be
> replaced, so it's just offsetting work from other backends (or the
> startup process in this case).

Which only needs to be done if we follow this route, so is not an
argument in favour.

Using more I/O in I/O bound cases makes the worst case even worse.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 12:18:14
Message-ID:	498AD906.1030507@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-02-05 at 13:18 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Thu, 2009-02-05 at 11:46 +0200, Heikki Linnakangas wrote:
>>>> Simon Riggs wrote:
>>>>> So we might end up flushing more often *and* we will be doing it
>>>>> potentially in the code path of other users.
>>>> For example, imagine a database that fits completely in shared buffers.
>>>> If we update at every XLogFileRead, we have to fsync every 16MB of WAL.
>>>> If we update in XLogFlush the way I described, you only need to update
>>>> when we flush a page from the buffer cache, which will only happen at
>>>> restartpoints. That's far less updates.
>>> Oh, did you change the bgwriter so it doesn't do normal page cleaning?
>> No. Ok, that wasn't completely accurate. The page cleaning by bgwriter
>> will perform XLogFlushes, but that should be pretty insignificant. When
>> there's little page replacement going on, bgwriter will do a small
>> trickle of page cleaning, which won't matter much.
>
> Yes, that case is good, but it wasn't the use case we're trying to speed
> up by having the bgwriter active during recovery. We're worried about
> I/O bound recoveries.

Ok, let's do the math:

By updating minRecoveryPoint in XLogFileRead, you're fsyncing the
control file once every 16MB of WAL.

By updating in XLogFlush, the frequency depends on the amount of
shared_buffers available to buffer the modified pages, the average WAL
record size, and the cache hit ratio. Let's determine the worst case:

The smallest WAL record that dirties a page is a heap deletion record.
That contains just enough information to locate the tuple. If I'm
reading the headers right, that record is 48 bytes long (28 bytes of
xlog header + 18 bytes of payload + padding). Assuming that the WAL is
full of just those records, and there's no full page images, and that
the cache hit ratio is 0%, we will need (16 MB / 48 B) * 8 kB = 2730 MB
of shared_buffers to achieve the once per 16 MB of WAL per one fsync mark.

So if you have a lower shared_buffers setting than 2.7 GB, you can have
more frequent fsyncs this way in the worst case. If you think of the
typical case, you're probably not doing all deletes, and you're having a
non-zero cache hit ratio, so you achieve the same frequency with a much
lower shared_buffers setting. And if you're really that I/O bound, I
doubt the few extra fsyncs matter much.

Also note that when the control file is updated in XLogFlush, it's
typically the bgwriter doing it as it cleans buffers ahead of the clock
hand, not the startup process.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 14:19:51
Message-ID:	1233843591.4500.459.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-02-05 at 14:18 +0200, Heikki Linnakangas wrote:

> when the control file is updated in XLogFlush, it's
> typically the bgwriter doing it as it cleans buffers ahead of the
> clock hand, not the startup process

That is the key point. Let's do it your way.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 19:54:50
Message-ID:	498B440A.1030101@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ok, here's another version. Major changes since last patch:

- Startup checkpoint is now again performed after the recovery is
finished, before allowing (read-write) connections. This is because we
couldn't solve the problem of re-entering recovery after a crash before
the first online checkpoint.

- minSafeStartPoint is gone, and its functionality has been folded into
minRecoveryPoint. It was really the same semantics. There might have
been some debugging value in keeping the backup stop time around, but
it's in the backup label file in the base backup anyway.

- minRecoveryPoint is now updated in XLogFlush, instead of when a file
is restored from archive.

- log_restartpoints is gone. Use log_checkpoints in postgresql.conf
instead

Outstanding issues:

- If bgwriter is performing a restartpoint when recovery ends, the
startup checkpoint will be queued up behind the restartpoint. And since
it uses the same smoothing logic as checkpoints, it can take quite some
time for that to finish. The original patch had some code to hurry up
the restartpoint by signaling the bgwriter if
LWLockConditionalAcquire(CheckPointLock) fails, but there's a race
condition with that if a restartpoint starts right after that check. We
could let the bgwriter do the checkpoint too, and wait for it, but
bgwriter might not be running yet, and we'd have to allow bgwriter to
write WAL while disallowing it for all other processes, which seems
quite complex. Seems like we need something like the
LWLockConditionalAcquire approach, but built into CreateCheckPoint to
eliminate the race condition

- If you perform a fast shutdown while startup process is waiting for
the restore command, startup process sometimes throws a FATAL error
which leads escalates into an immediate shutdown. That leads to
different messages in the logs, and skipping of the shutdown
restartpoint that we now otherwise perform.

- It's not clear to me if the rest of the xlog flushing related
functions, XLogBackgroundFlush, XLogNeedsFlush and XLogAsyncCommitFlush,
need to work during recovery, and what they should do.

I'll continue working on those outstanding items.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
recovery-infra-41d3bcb.patch	text/x-diff	63.0 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-05 20:16:09
Message-ID:	1233864969.4500.546.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote:

> - If bgwriter is performing a restartpoint when recovery ends, the
> startup checkpoint will be queued up behind the restartpoint. And since
> it uses the same smoothing logic as checkpoints, it can take quite some
> time for that to finish. The original patch had some code to hurry up
> the restartpoint by signaling the bgwriter if
> LWLockConditionalAcquire(CheckPointLock) fails, but there's a race
> condition with that if a restartpoint starts right after that check. We
> could let the bgwriter do the checkpoint too, and wait for it, but
> bgwriter might not be running yet, and we'd have to allow bgwriter to
> write WAL while disallowing it for all other processes, which seems
> quite complex. Seems like we need something like the
> LWLockConditionalAcquire approach, but built into CreateCheckPoint to
> eliminate the race condition

Seems straightforward? Hold the lock longer.

> - If you perform a fast shutdown while startup process is waiting for
> the restore command, startup process sometimes throws a FATAL error
> which leads escalates into an immediate shutdown. That leads to
> different messages in the logs, and skipping of the shutdown
> restartpoint that we now otherwise perform.

Sometimes?

> - It's not clear to me if the rest of the xlog flushing related
> functions, XLogBackgroundFlush, XLogNeedsFlush and XLogAsyncCommitFlush,
> need to work during recovery, and what they should do.

XLogNeedsFlush should always return false InRecoveryProcessingMode().
The WAL is already in the WAL files, not in wal_buffers anymore.

XLogAsyncCommitFlush should contain Assert(!InRecoveryProcessingMode())
since it is called during a VACUUM FULL only.

XLogBackgroundFlush should never be called during recovery because the
WALWriter is never active in recovery. That should just be documented.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-06 08:06:15
Message-ID:	498BEF77.2070100@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote:
>> - If you perform a fast shutdown while startup process is waiting for
>> the restore command, startup process sometimes throws a FATAL error
>> which leads escalates into an immediate shutdown. That leads to
>> different messages in the logs, and skipping of the shutdown
>> restartpoint that we now otherwise perform.
>
> Sometimes?

I think what happens is that if the restore command receives the SIGTERM
and dies before the startup process that's waiting for the restore
command receives the SIGTERM, the startup process throws a FATAL error
because the restore command died unexpectedly. I put this

> if (shutdown_requested && InRedo)
> {
> /* XXX: Is EndRecPtr always the right value here? */
> UpdateMinRecoveryPoint(EndRecPtr);
> proc_exit(0);
> }

right after the "system(xlogRestoreCmd)" call, to exit gracefully if we
were requested to shut down while restore command was running, but it
seems that that's not enough because of the race condition.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-06 08:17:14
Message-ID:	1233908234.4500.653.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2009-02-06 at 10:06 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote:
> >> - If you perform a fast shutdown while startup process is waiting for
> >> the restore command, startup process sometimes throws a FATAL error
> >> which leads escalates into an immediate shutdown. That leads to
> >> different messages in the logs, and skipping of the shutdown
> >> restartpoint that we now otherwise perform.
> >
> > Sometimes?
>
> I think what happens is that if the restore command receives the SIGTERM
> and dies before the startup process that's waiting for the restore
> command receives the SIGTERM, the startup process throws a FATAL error
> because the restore command died unexpectedly. I put this
>
> > if (shutdown_requested && InRedo)
> > {
> > /* XXX: Is EndRecPtr always the right value here? */
> > UpdateMinRecoveryPoint(EndRecPtr);
> > proc_exit(0);
> > }
>
> right after the "system(xlogRestoreCmd)" call, to exit gracefully if we
> were requested to shut down while restore command was running, but it
> seems that that's not enough because of the race condition.

Can we trap the death of the restorecmd and handle it differently from
the death of the startup process?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-09 15:13:02
Message-ID:	499047FE.9090407@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Fri, 2009-02-06 at 10:06 +0200, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Thu, 2009-02-05 at 21:54 +0200, Heikki Linnakangas wrote:
>>>> - If you perform a fast shutdown while startup process is waiting for
>>>> the restore command, startup process sometimes throws a FATAL error
>>>> which leads escalates into an immediate shutdown. That leads to
>>>> different messages in the logs, and skipping of the shutdown
>>>> restartpoint that we now otherwise perform.
>>> Sometimes?
>> I think what happens is that if the restore command receives the SIGTERM
>> and dies before the startup process that's waiting for the restore
>> command receives the SIGTERM, the startup process throws a FATAL error
>> because the restore command died unexpectedly. I put this
>>
>>> if (shutdown_requested && InRedo)
>>> {
>>> /* XXX: Is EndRecPtr always the right value here? */
>>> UpdateMinRecoveryPoint(EndRecPtr);
>>> proc_exit(0);
>>> }
>> right after the "system(xlogRestoreCmd)" call, to exit gracefully if we
>> were requested to shut down while restore command was running, but it
>> seems that that's not enough because of the race condition.
>
> Can we trap the death of the restorecmd and handle it differently from
> the death of the startup process?

The startup process launches the restore command, so it's the startup
process that needs to handle its death.

Anyway, I think I've found a solution. While we're executing the restore
command, we're in a state that it's safe to proc_exit(0). We can set a
flag to indicate to the signal handler when we're executing the restore
command, so that the signal handler can do proc_exit(0) on SIGTERM. So
if the startup process receives the SIGTERM first, it will proc_exit(0)
immediately, and if the restore command dies first due to the SIGTERM,
startup process exits with proc_exit(0) when it sees that restore
command exited because of the SIGTERM. If either process receives
SIGTERM for some other reason than a fast shutdown request, postmaster
will see that the startup process exited unexpectedly, and handles that
like a child process crash.

Attached is an updated patch that does that, and I've fixed all the
other outstanding issues I listed earlier as well. Now I'm feeling again
that this is in pretty good shape.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
recovery-infra-2ffabdc.patch	text/x-diff	68.9 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-18 12:20:07
Message-ID:	1234959607.3973.58.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2009-02-09 at 17:13 +0200, Heikki Linnakangas wrote:

> Attached is an updated patch that does that, and I've fixed all the
> other outstanding issues I listed earlier as well. Now I'm feeling
> again that this is in pretty good shape.

UpdateMinRecoveryPoint() issues a DEBUG2 message even when we have not
updated the control file, leading to log filling behaviour on an idle
system.

DEBUG: updated min recovery point to ...

We should just tuck the message into the "if" section above it.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-18 12:26:12
Message-ID:	499BFE64.6070300@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Mon, 2009-02-09 at 17:13 +0200, Heikki Linnakangas wrote:
>
>> Attached is an updated patch that does that, and I've fixed all the
>> other outstanding issues I listed earlier as well. Now I'm feeling
>> again that this is in pretty good shape.
>
> UpdateMinRecoveryPoint() issues a DEBUG2 message even when we have not
> updated the control file, leading to log filling behaviour on an idle
> system.
>
> DEBUG: updated min recovery point to ...
>
> We should just tuck the message into the "if" section above it.

The outer "if" should ensure that it isn't printed repeatedly on an idle
system. But I agree it belongs inside the inner if section.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-18 12:40:08
Message-ID:	1234960808.3973.60.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-02-18 at 14:26 +0200, Heikki Linnakangas wrote:

> The outer "if" should ensure that it isn't printed repeatedly on an idle
> system.

Regrettably not.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-18 16:01:53
Message-ID:	499C30F1.8010800@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Wed, 2009-02-18 at 14:26 +0200, Heikki Linnakangas wrote:
>
>> The outer "if" should ensure that it isn't printed repeatedly on an idle
>> system.
>
> Regrettably not.

Ok, committed. I fixed that and some comment changes. I also renamed
IsRecoveryProcessingMode() to RecoveryInProgress(), to avoid confusion
with the "real" processing modes defined in miscadmin.h. That will
probably cause you merge conflicts in the hot standby patch, but it
should be a matter of search-replace to fix.

The changes need to be documented. At least the removal of
log_restartpoints is a clear user-visible change.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-18 16:21:27
Message-ID:	1234974087.3973.83.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2009-02-18 at 18:01 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Wed, 2009-02-18 at 14:26 +0200, Heikki Linnakangas wrote:
> >
> >> The outer "if" should ensure that it isn't printed repeatedly on an idle
> >> system.
> >
> > Regrettably not.
>
> Ok, committed.

Cool.

> I fixed that and some comment changes. I also renamed
> IsRecoveryProcessingMode() to RecoveryInProgress(), to avoid confusion
> with the "real" processing modes defined in miscadmin.h. That will
> probably cause you merge conflicts in the hot standby patch, but it
> should be a matter of search-replace to fix.

Yep, good change, agree with reasons.

> The changes need to be documented. At least the removal of
> log_restartpoints is a clear user-visible change.

Yep.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-26 17:19:47
Message-ID:	3f0b79eb0902260919l2675aaafq10e5b2d49ebfa3a1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, Jan 30, 2009 at 7:47 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> That whole area was something I was leaving until last, since immediate
> shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
> this before Christmas, briefly).

This problem remains in current HEAD. I mean, immediate shutdown
may be unable to kill the startup process because system() which
executes restore_command ignores SIGQUIT while waiting.
When I tried immediate shutdown during recovery, only the startup
process survived. This is undesirable behavior, I think.

The following code should be added into RestoreArchivedFile()?

----
if (WTERMSIG(rc) == SIGQUIT)
exit(2);
----

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-26 18:38:33
Message-ID:	49A6E1A9.5020901@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> On Fri, Jan 30, 2009 at 7:47 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> That whole area was something I was leaving until last, since immediate
>> shutdown doesn't work either, even in HEAD. (Fujii-san and I discussed
>> this before Christmas, briefly).
>
> This problem remains in current HEAD. I mean, immediate shutdown
> may be unable to kill the startup process because system() which
> executes restore_command ignores SIGQUIT while waiting.
> When I tried immediate shutdown during recovery, only the startup
> process survived. This is undesirable behavior, I think.

Yeah, we need to fix that.

> The following code should be added into RestoreArchivedFile()?
>
> ----
> if (WTERMSIG(rc) == SIGQUIT)
> exit(2);
> ----

I don't see how that helps, as we already have this in there:

signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

ereport(signaled ? FATAL : DEBUG2,
(errmsg("could not restore file \"%s\" from archive: return code %d",
xlogfname, rc)));

which means we already ereport(FATAL) if the restore command dies with
SIGQUIT.

I think the real problem here is that pg_standby traps SIGQUIT. The
startup process doesn't receive the SIGQUIT because it's in system(),
and pg_standby doesn't propagate it to the startup process either
because it traps it.

I think we should simply remove the signal handler for SIGQUIT from
pg_standby. Or will that lead to core dump by default? In that case, we
need pg_standby to exit(128) or similar, so that RestoreArchivedFile
understands that the command was killed by a signal.

Another approach is to check that the postmaster is still alive, like we
do in walwriter and bgwriter:

/*
* Emergency bailout if postmaster has died. This is to avoid the
* necessity for manual cleanup of all postmaster children.
*/
if (!PostmasterIsAlive(true))
exit(1);

However, I'm afraid there's a race condition with that. If we do that
right after system(), postmaster might've signaled us but not exited
yet. We could check that in the main loop, but if we wrongly interpret
the exit of the recovery command as a "file not found - go ahead and
start up", the damage might be done by the time we notice that the
postmaster is gone.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-26 19:32:07
Message-ID:	3f0b79eb0902261132i7c62c6dbw211fe6964139c69e@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, Feb 27, 2009 at 3:38 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I think the real problem here is that pg_standby traps SIGQUIT. The startup
> process doesn't receive the SIGQUIT because it's in system(), and pg_standby
> doesn't propagate it to the startup process either because it traps it.

Yes, you are right.

> I think we should simply remove the signal handler for SIGQUIT from
> pg_standby.

> I don't see how that helps, as we already have this in there:
>
> signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;
>
> ereport(signaled ? FATAL : DEBUG2,
> (errmsg("could not restore file \"%s\" from archive: return code %d",
> xlogfname, rc)));
>
> which means we already ereport(FATAL) if the restore command dies with SIGQUIT.

SIGQUIT should kill the process immediately, so I think that the startup
process as well as other auxiliary process should call exit(2) instead of
ereport(FATAL). Right?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-26 19:49:16
Message-ID:	1235677756.16176.573.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2009-02-26 at 20:38 +0200, Heikki Linnakangas wrote:

> I think we should simply remove the signal handler for SIGQUIT from
> pg_standby.

If you do this, please make it release dependent so pg_standby behaves
correctly for the release it is being used with.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hot standby, recovery infra
Date:	2009-02-26 19:59:05
Message-ID:	49A6F489.109@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2009-02-26 at 20:38 +0200, Heikki Linnakangas wrote:
>
>> I think we should simply remove the signal handler for SIGQUIT from
>> pg_standby.
>
> If you do this, please make it release dependent so pg_standby behaves
> correctly for the release it is being used with.

Hmm, I don't think there's a way for pg_standby to know which version of
PostgreSQL is calling it. Assuming there is, how would you want it to
behave? If you want no change in behavior in old releases, can't we just
leave it unfixed in back-branches? In fact, it seems more useful to not
detect the server version, so that if you do want the new behavior, you
can use a 8.4 pg_standby against a 8.3 server.

In back-branches, I think we need to decide between fixing this, at the
risk of breaking someone's script that is using "killall -QUIT
pg_standby" or similar to trigger failover, and leaving it as it is
knowing that immediate shutdown doesn't work on a standby server. I'm
not sure which is best.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com