Recovery of Multi-stage WAL actions

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Recovery of Multi-stage WAL actions
Date: 2007-10-29 23:03:46
Message-ID: 1193699026.4260.81.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We've had two hard to diagnose errors in recovery in recent months. ISTM
that the core issue is the way we allow Resource Managers to have
multi-stage WAL actions that persist for long periods of time. This
means we have no way of telling whether the answer
rm_safe_restartpoint() == false is a momentary, valid state or a
progressively worsening indicator of a subtle RM bug.

An example of a multi-stage WAL action would be an index split inside
one of the Resource Managers. Now that kind of action shouldn't take
very long, though theoretically it could for various reasons.

Right now we have a log message to cope with this:
http://archives.postgresql.org/pgsql-committers/2007-08/msg00374.php
but its not nearly as helpful as we'd like it to be. We could back-patch
this to 8.2, but I have a potentially better proposal.

I very much want to encourage authors of new Resource Managers and it
looks like we may be getting at least 3 new RMs that produce WAL
records: hash indexes (currently not WAL-logged), bitmap indexes and
clustered indexes for 8.4. We should be realistic that new bugs probably
will occur in recovery code for existing and new RMs.

What I'd like to do is force all of the RMs to record the lsn of any WAL
record that starts an incomplete action. Then, if an incomplete action
lives for more than a certain period of time it will be possible to
produce a log message saying "incomplete split has survived for X
seconds, in xlog time". That way we'll see log messages if any of the
RMs start to push an incomplete action onto their list and then not
consume it again.

We might trust each RM to implement code to LOG messages if their code
goes a little awry, but I'd prefer some mechanism that allows the main
server to check what's happening in each RM. That way we'd have a
cross-check on whether the RM is well-behaved, plus we'd only need to
implement the checking code once. Right now very similar, yet different
code runs inside each RM.

So my proposal is to have an incomplete split remember/forget API that
forces each RM to expose its incomplete split List. Currently each RM
has a hook on rm_saferestartpoint(), so that each RM manages its own
List. My new thought is to have one function safeRestartPoint() that
inspects each of the incomplete split Lists to see if they are empty. If
the lists are non-empty then inspect the age of each list entry to see
if it is worth reporting as a possible issue. Each RM would then store
incomplete splits using a ResourceManagerRememberIncompleteEvent(lsn,
id_data, payload??) and ResourceManagerforgetIncompleteEvent(id_data).
Implementation is a a bit hazy on that last part, but I think the
overall idea is clear.

That should mean that any incomplete split that lasts for the length of
one restartpoint, which is *at least* one checkpoint duration, should
cause a LOG message to be produced. We might even go as far as to ignore
super long-lived and therefore spurious incomplete splits when we issue
rm_cleanup() for fear of allowing RM bugs to kill recovery.

I'd like to suggest that those changes be performed now for 8.3 *and*
back-patched for 8.2. I want to make sure that all users are able to
diagnose server errors and report them. I'm guessing that might raise a
few eyebrows, but I think its justifiable. Bugs in complex code are
inevitable and should not be seen to reflect badly upon RM authors.
However, our inability to recognise RM bugs that do occur doesn't seem
acceptable to me, especially since they may save themselves up for the
moment of PITR fail-over. You might persuade me I'm being over-zealous
here, but High Availability is something we have to be zealous about.

It should also be possible to allow the server to stay up even if one of
the RM's fails to recover properly. That would need to be settable, so I
really only mean that for "optional" RMs, i.e. index RMs only. For those
cases we should be able to mark effected indexes by marking them
corrupt. Automatic rebuild of corrupt indexes could also be possible,
should it occur. That would be an 8.4 action... :-)

Comments appreciated, as ever.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Fetter 2007-10-29 23:42:54 Re: Proposal: real procedures again (8.4)
Previous Message James Mansion 2007-10-29 22:32:44 Re: Proposal: real procedures again (8.4)