Re: [HACKERS] Restartable Recovery

Lists: pgsql-hackerspgsql-patches
From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Marko Kreen <markokr(at)gmail(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Restartable Recovery
Date: 2006-07-11 16:55:46
Message-ID: 1152636947.2465.14.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


On Marko Kreen's detailed suggestion, I've implemented a restartable
recovery mode for archive recovery (aka PITR). Restart points are known
as recovery checkpoints and are normally taken every 100 checkpoints in
the log to ensure good recovery performance.

An additional mode
standby_mode = 'true'
can also be specified, which ensures that a recovery checkpoint occurs
for each checkpoint in the logs.

Some other code refactorings, though all changes isolated to xlog.c and
to pg_control.h; code comments welcome.

Applies cleanly to cvstip, passes make check.

Further details testing is very desirable. I've tested restarting a
recovery twice and things work successfully.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
restartableRecovery.patch text/x-patch 14.5 KB

From: Andreas Seltenreich <andreas+pg(at)gate450(dot)dyndns(dot)org>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Marko Kreen <markokr(at)gmail(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Restartable Recovery
Date: 2006-07-16 12:22:31
Message-ID: 87odvpkad4.fsf@gate450.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:

> [2. text/x-patch; restartableRecovery.patch]

Hmm, wouldn't you have to reboot the resource managers at each
checkpoint? I'm afraid otherwise things like postponed page splits
could get lost on restart from a later checkpoint.

regards,
andreas


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgresql(dot)org, Andreas Seltenreich <andreas+pg(at)gate450(dot)dyndns(dot)org>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Marko Kreen <markokr(at)gmail(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Restartable Recovery
Date: 2006-07-16 14:51:47
Message-ID: 29986.1153061507@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andreas Seltenreich <andreas+pg(at)gate450(dot)dyndns(dot)org> writes:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> [2. text/x-patch; restartableRecovery.patch]

> Hmm, wouldn't you have to reboot the resource managers at each
> checkpoint? I'm afraid otherwise things like postponed page splits
> could get lost on restart from a later checkpoint.

Ouch. That's a bit nasty. You can't just apply a postponed split at
checkpoint time, because the WAL record could easily be somewhere after
the checkpoint, leading to duplicate insertions. Right offhand I don't
see how to make this work :-(

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Andreas Seltenreich <andreas+pg(at)gate450(dot)dyndns(dot)org>, Marko Kreen <markokr(at)gmail(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: Restartable Recovery
Date: 2006-07-16 15:57:55
Message-ID: 1153065476.2654.247.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Sun, 2006-07-16 at 10:51 -0400, Tom Lane wrote:
> Andreas Seltenreich <andreas+pg(at)gate450(dot)dyndns(dot)org> writes:
> > Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> >> [2. text/x-patch; restartableRecovery.patch]
>
> > Hmm, wouldn't you have to reboot the resource managers at each
> > checkpoint? I'm afraid otherwise things like postponed page splits
> > could get lost on restart from a later checkpoint.
>
> Ouch. That's a bit nasty. You can't just apply a postponed split at
> checkpoint time, because the WAL record could easily be somewhere after
> the checkpoint, leading to duplicate insertions. Right offhand I don't
> see how to make this work :-(

Yes, ouch. So much for gung-ho code sprints; thanks Andreas.

To do this we would need to have another rmgr specific routine that gets
called at a recovery checkpoint. This would then write to disk the
current state of the incomplete multi-WAL actions, in some manner.
During the startup routines we would check for any pre-existing state
files and use those to initialise the incomplete action cache. Cleanup
would then discard all state files.

That allows us to not-forget actions, but it doesn't help us if there
are problems repeating actions twice. We would at least know that we are
in a potential double-action zone and could give different kinds of
errors or handling.

Or we can simply mark any indexes incomplete-needs-rebuild if they had a
page split during the overlap time between the last known good recovery
checkpoint and the following one. But that does lead to randomly bounded
recovery time, which might be better to have started from scratch
anyway.

Given time available for 8.2, neither one is a quick fix.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Andreas Seltenreich <andreas+pg(at)gate450(dot)dyndns(dot)org>, Marko Kreen <markokr(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: [PATCHES] Restartable Recovery
Date: 2006-07-16 16:40:46
Message-ID: 6643.1153068046@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Sun, 2006-07-16 at 10:51 -0400, Tom Lane wrote:
>> Ouch. That's a bit nasty. You can't just apply a postponed split at
>> checkpoint time, because the WAL record could easily be somewhere after
>> the checkpoint, leading to duplicate insertions.

> To do this we would need to have another rmgr specific routine that gets
> called at a recovery checkpoint. This would then write to disk the
> current state of the incomplete multi-WAL actions, in some manner.
> During the startup routines we would check for any pre-existing state
> files and use those to initialise the incomplete action cache. Cleanup
> would then discard all state files.

I thought about that too, but it seems very messy, eg you'd have to
actually fsync the state files to be sure they were safely down to disk.
Another problem is that WAL records between the checkpoint's REDO point
and the physical checkpoint location could get replayed twice, leading
to duplicate entries in the rmgr's state. Consider a split start WAL
entry located in that range, with the split completion entry after the
checkpoint --- on restart, we'd load a pending-split entry from the
state file and then create another one on seeing the split-start record
again.

A compromise that might be good enough is to add an rmgr routine defined
as "bool is_idle(void)" that tests whether the rmgr has any open state
to worry about. Then, recovery checkpoints are done only if all rmgrs
say they are idle. That is, we only checkpoint if there is not a need
for any state files. At least for btree's usage, this should be all
right since the "split pending" state is short-lived and so most of the
time we'd not need to skip checkpoints. I'm not totally sure about GIST
or GIN though (Teodor?).

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Andreas Seltenreich <andreas+pg(at)gate450(dot)dyndns(dot)org>, Marko Kreen <markokr(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: [PATCHES] Restartable Recovery
Date: 2006-07-16 18:42:22
Message-ID: 1153075343.2654.263.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Sun, 2006-07-16 at 12:40 -0400, Tom Lane wrote:

> A compromise that might be good enough is to add an rmgr routine defined
> as "bool is_idle(void)" that tests whether the rmgr has any open state
> to worry about. Then, recovery checkpoints are done only if all rmgrs
> say they are idle.

Like it.

> That is, we only checkpoint if there is not a need
> for any state files. At least for btree's usage, this should be all
> right since the "split pending" state is short-lived and so most of the
> time we'd not need to skip checkpoints. I'm not totally sure about GIST
> or GIN though (Teodor?).

Considering how infrequently we wanted to do recovery checkpoints, this
is unlikely to cause any issue. But in any case, this is the best we can
give people, rather than a compromise.

Perhaps that should be extended to say whether there are any
non-idempotent changes made in the last checkpoint period. That might
cover a wider set of potential actions.

If index splits in GIST or GIN are *not* short lived, then I would
imagine we'd have some serious contention problems to clear up since an
inconsistent index is unusable and would require portions of it to be
locked throughout such operations to ensure their atomicity.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Andreas Seltenreich <andreas+pg(at)gate450(dot)dyndns(dot)org>, Marko Kreen <markokr(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: [PATCHES] Restartable Recovery
Date: 2006-07-16 19:33:33
Message-ID: 20723.1153078413@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Sun, 2006-07-16 at 12:40 -0400, Tom Lane wrote:
>> A compromise that might be good enough is to add an rmgr routine defined
>> as "bool is_idle(void)" that tests whether the rmgr has any open state
>> to worry about. Then, recovery checkpoints are done only if all rmgrs
>> say they are idle.

> Perhaps that should be extended to say whether there are any
> non-idempotent changes made in the last checkpoint period. That might
> cover a wider set of potential actions.

Perhaps best to call it safe_to_checkpoint(), and not pre-judge what
reasons the rmgr might have for not wanting to restart here.

If we are only going to do a recovery checkpoint at every Nth checkpoint
record, then occasionally having to skip one seems no big problem ---
just do it at the first subsequent record that is safe.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Andreas Seltenreich <andreas+pg(at)gate450(dot)dyndns(dot)org>, Marko Kreen <markokr(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: [PATCHES] Restartable Recovery
Date: 2006-07-16 19:56:24
Message-ID: 1153079785.2654.266.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Sun, 2006-07-16 at 15:33 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Sun, 2006-07-16 at 12:40 -0400, Tom Lane wrote:
> >> A compromise that might be good enough is to add an rmgr routine defined
> >> as "bool is_idle(void)" that tests whether the rmgr has any open state
> >> to worry about. Then, recovery checkpoints are done only if all rmgrs
> >> say they are idle.
>
> > Perhaps that should be extended to say whether there are any
> > non-idempotent changes made in the last checkpoint period. That might
> > cover a wider set of potential actions.
>
> Perhaps best to call it safe_to_checkpoint(), and not pre-judge what
> reasons the rmgr might have for not wanting to restart here.

You read my mind.

> If we are only going to do a recovery checkpoint at every Nth checkpoint
> record, then occasionally having to skip one seems no big problem ---
> just do it at the first subsequent record that is safe.

Got it.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-patches(at)postgresql(dot)org
Cc: pgsql-hackers(at)postgresql(dot)org, Andreas Seltenreich <andreas+pg(at)gate450(dot)dyndns(dot)org>, Marko Kreen <markokr(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: [HACKERS] Restartable Recovery
Date: 2006-08-01 00:04:07
Message-ID: 1154390648.3226.47.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Sun, 2006-07-16 at 20:56 +0100, Simon Riggs wrote:
> On Sun, 2006-07-16 at 15:33 -0400, Tom Lane wrote:
> > Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > > On Sun, 2006-07-16 at 12:40 -0400, Tom Lane wrote:
> > >> A compromise that might be good enough is to add an rmgr routine defined
> > >> as "bool is_idle(void)" that tests whether the rmgr has any open state
> > >> to worry about. Then, recovery checkpoints are done only if all rmgrs
> > >> say they are idle.
> >
> > > Perhaps that should be extended to say whether there are any
> > > non-idempotent changes made in the last checkpoint period. That might
> > > cover a wider set of potential actions.
> >
> > Perhaps best to call it safe_to_checkpoint(), and not pre-judge what
> > reasons the rmgr might have for not wanting to restart here.
>
> You read my mind.
>
> > If we are only going to do a recovery checkpoint at every Nth checkpoint
> > record, then occasionally having to skip one seems no big problem ---
> > just do it at the first subsequent record that is safe.
>
> Got it.

I've implemented this for BTree, GIN, GIST using an additional rmgr
function bool rm_safe_restartpoint(void)

The functions are actually trivial, assuming I've understood this and
how GIST and GIN work for their xlogging.

"Recovery checkpoints" are now renamed "restartpoints" to avoid
confusion with checkpoints. So checkpoints occur during normal
processing (only) and restartpoints occur during recovery (only).

Updated patch enclosed, which I believe has no conflicts with the other
patches on xlog.c just submitted.

Much additional testing required, but the underlying concepts are very
simple really. Andreas: any further gotchas? :-)

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
restartableRecovery2.patch text/x-patch 21.8 KB

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-patches(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org, Andreas Seltenreich <andreas+pg(at)gate450(dot)dyndns(dot)org>, Marko Kreen <markokr(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: [HACKERS] Restartable Recovery
Date: 2006-08-01 00:09:55
Message-ID: 200608010009.k7109te19208@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Nice. I was going to ask if this could make it into 8.2.

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Sun, 2006-07-16 at 20:56 +0100, Simon Riggs wrote:
> > On Sun, 2006-07-16 at 15:33 -0400, Tom Lane wrote:
> > > Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > > > On Sun, 2006-07-16 at 12:40 -0400, Tom Lane wrote:
> > > >> A compromise that might be good enough is to add an rmgr routine defined
> > > >> as "bool is_idle(void)" that tests whether the rmgr has any open state
> > > >> to worry about. Then, recovery checkpoints are done only if all rmgrs
> > > >> say they are idle.
> > >
> > > > Perhaps that should be extended to say whether there are any
> > > > non-idempotent changes made in the last checkpoint period. That might
> > > > cover a wider set of potential actions.
> > >
> > > Perhaps best to call it safe_to_checkpoint(), and not pre-judge what
> > > reasons the rmgr might have for not wanting to restart here.
> >
> > You read my mind.
> >
> > > If we are only going to do a recovery checkpoint at every Nth checkpoint
> > > record, then occasionally having to skip one seems no big problem ---
> > > just do it at the first subsequent record that is safe.
> >
> > Got it.
>
> I've implemented this for BTree, GIN, GIST using an additional rmgr
> function bool rm_safe_restartpoint(void)
>
> The functions are actually trivial, assuming I've understood this and
> how GIST and GIN work for their xlogging.
>
> "Recovery checkpoints" are now renamed "restartpoints" to avoid
> confusion with checkpoints. So checkpoints occur during normal
> processing (only) and restartpoints occur during recovery (only).
>
> Updated patch enclosed, which I believe has no conflicts with the other
> patches on xlog.c just submitted.
>
> Much additional testing required, but the underlying concepts are very
> simple really. Andreas: any further gotchas? :-)
>
> --
> Simon Riggs
> EnterpriseDB http://www.enterprisedb.com

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org, Andreas Seltenreich <andreas+pg(at)gate450(dot)dyndns(dot)org>, Marko Kreen <markokr(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: [HACKERS] Restartable Recovery
Date: 2006-08-07 17:05:38
Message-ID: 671.1154970338@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> I've implemented this for BTree, GIN, GIST using an additional rmgr
> function bool rm_safe_restartpoint(void)
> ...
> "Recovery checkpoints" are now renamed "restartpoints" to avoid
> confusion with checkpoints. So checkpoints occur during normal
> processing (only) and restartpoints occur during recovery (only).

Applied with revisions. As submitted the patch pushed backup_label out
of the way immediately upon reading it, which is no good: you need to be
sure that the starting checkpoint location is written to pg_control
first, else an immediate crash would allow the thing to try to start
from whatever checkpoint is listed in the backed-up pg_control. Also,
the minimum recovery stopping point that's obtained using the label file
still has to be enforced if there's a crash during the replay sequence.
I felt the best way to do that was to copy the minimum stopping point
into pg_control, so that's what the code does now.

Also, as I mentioned earlier, I think that doing restartpoints on the
basis of elapsed time is simpler and more useful than having an explicit
distinction between "normal" and "standby" modes. We can always invent
a standby_mode flag later if we need one, but we don't need it for this.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, pgsql-patches(at)postgresql(dot)org, Andreas Seltenreich <andreas+pg(at)gate450(dot)dyndns(dot)org>, Marko Kreen <markokr(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: [HACKERS] Restartable Recovery
Date: 2006-08-09 11:48:08
Message-ID: 1155124089.2368.87.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Mon, 2006-08-07 at 13:05 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > I've implemented this for BTree, GIN, GIST using an additional rmgr
> > function bool rm_safe_restartpoint(void)
> > ...
> > "Recovery checkpoints" are now renamed "restartpoints" to avoid
> > confusion with checkpoints. So checkpoints occur during normal
> > processing (only) and restartpoints occur during recovery (only).
>
> Applied with revisions.

err....CheckPointGuts() :-) I guess patch reviews need some spicing up.

> As submitted the patch pushed backup_label out
> of the way immediately upon reading it, which is no good: you need to be
> sure that the starting checkpoint location is written to pg_control
> first, else an immediate crash would allow the thing to try to start
> from whatever checkpoint is listed in the backed-up pg_control. Also,
> the minimum recovery stopping point that's obtained using the label file
> still has to be enforced if there's a crash during the replay sequence.
> I felt the best way to do that was to copy the minimum stopping point
> into pg_control, so that's what the code does now.

Thanks for checking that.

> Also, as I mentioned earlier, I think that doing restartpoints on the
> basis of elapsed time is simpler and more useful than having an explicit
> distinction between "normal" and "standby" modes. We can always invent
> a standby_mode flag later if we need one, but we don't need it for this.

OK, agreed.

The original thinking was that writing a restartpoint was more crucial
when in standby mode; but this way we've better performance and have a
low ceiling on the restart time if that should ever occur at the worst
moment.

Thanks again to Marko for the concept.

I'll work on the docs for backup.sgml also.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com