Quick Links

Re: [HACKERS] Point in Time Recovery

Lists:	pgsql-adminpgsql-hackerspgsql-patches

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Point in Time Recovery
Date:	2004-07-05 20:45:00
Message-ID:	1089060299.17493.58.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Taking advantage of the freeze bubble allowed us... there are some last
minute features to add.

Summarising earlier thoughts, with some detailed digging and design from
myself in last few days - we're now in a position to add Point-in-Time
Recovery, on top of whats been achieved.

The target for the last record to recover to can be specified in 2 ways:
- by transactionId - not that useful, unless you have a means of
identifying what has happened from the log, then using that info to
specify how to recover - coming later - not in next few days :(
- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope....)

Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?
If we did, would that be portable?
Suggestions welcome, because I know very little of the details of
various *nix systems and win* on that topic.

Only COMMIT and ABORT records have timestamps, allowing us to circumvent
any discussion about partial transaction recovery and nested
transactions.

When we do recover, stopping at the timestamp is just half the battle.
We need to leave the xlog in which we stop in a state from which we can
enter production smoothly and cleanly. To do this, we could:
- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.
I'm tempted by the first plan, because it is more straightforward and
stands much less chance of me introducing 50 wierd bugs just before
close.

Also, I think it is straightforward to introduce control file duplexing,
with a second copy stored and maintained in the pg_xlog directory. This
would provide additional protection for pg_control, which takes on more
importance now that archive recovery is working. pg_xlog is a natural
home, since on busy systems it's on its own disk away from everything
else, ensuring that at least one copy survives. I can't see a downside
to that, but others might... We can introduce user specifiable
duplexing, in later releases.

For later, I envisage an off-line utility that can be used to inspect
xlog records. This could provide a number of features:
- validate archived xlogs, to check they are sound.
- produce summary reports, to allow identification of transactionIds and
the effects of particular transactions
- performance analysis to allow decisions to be made about whether group
commit features could be utilised to good effect
(Not now...)

Best regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-05 21:46:56
Message-ID:	4263.1089064016@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Should we use a different datatype than time_t for the commit timestamp,
> one that offers more fine grained differentiation between checkpoints?

Pretty much everybody supports gettimeofday() (time_t and separate
integer microseconds); you might as well use that. Note that the actual
resolution is not necessarily microseconds, and it'd still not be
certain that successive commits have distinct timestamps --- so maybe
this refinement would be pointless. You'll still have to design a user
interface that allows selection without the assumption of distinct
timestamps.

> - when we stop, keep reading records until EOF, just don't apply them.
> When we write a checkpoint at end of recovery, the unapplied
> transactions are buried alive, never to return.
> - stop where we stop, then force zeros to EOF, so that no possible
> record remains of previous transactions.

Go with plan B; it's best not to destroy data (what if you chose the
wrong restart point the first time)?

Actually this now reminds me of a discussion I had with Patrick
Macdonald some time ago. The DB2 practice in this connection is that
you *never* overwrite existing logfile data when recovering. Instead
you start a brand new xlog segment file, which is given a new "branch
number" so it can be distinguished from the future-time xlog segments
that you chose not to apply. I don't recall what the DB2 terminology
was exactly --- not "branch number" I don't think --- but anyway the
idea is that when you restart the database after an incomplete recovery,
you are now in a sort of parallel universe that has its own history
after the branch point (PITR stop point). You need to be able to
distinguish archived log segments of this parallel universe from those
of previous and subsequent incarnations. I'm not sure whether Vadim
intended our StartUpID to serve this purpose, but it could perhaps be
used that way, if we reflected it in the WAL file names.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-05 23:11:45
Message-ID:	1089069104.17493.132.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > Should we use a different datatype than time_t for the commit timestamp,
> > one that offers more fine grained differentiation between checkpoints?
>
> Pretty much everybody supports gettimeofday() (time_t and separate
> integer microseconds); you might as well use that. Note that the actual
> resolution is not necessarily microseconds, and it'd still not be
> certain that successive commits have distinct timestamps --- so maybe
> this refinement would be pointless. You'll still have to design a user
> interface that allows selection without the assumption of distinct
> timestamps.

Well, I agree, though without the desired-for UI now, I think some finer
grained mechanism would be good. This means extending the xlog commit
record by a couple of bytes...OK, lets live a little.

eh? Which way round? The second plan was the one where I would destroy
data by overwriting it, thats exactly why I preferred the first.

Actually, the files are always copied from archive, so re-recovery is
always an available option in the design thats been implemented.

No matter...

> Actually this now reminds me of a discussion I had with Patrick
> Macdonald some time ago. The DB2 practice in this connection is that
> you *never* overwrite existing logfile data when recovering. Instead
> you start a brand new xlog segment file,

Now thats a much better plan...I suppose I just have to rack up the
recovery pointer to the first record on the first page of a new xlog
file, similar to first plan, but just fast-forwarding rather than
forwarding.

My only issue was to do with the secondary Checkpoint marker, which is
always reset to the place you just restored FROM, when you complete a
recovery. That could lead to a situation where you recover, then before
next checkpoint, fail and lose last checkpoint marker, then crash
recover from previous checkpoint (again), but this time replay the
records you were careful to avoid.

> which is given a new "branch
> number" so it can be distinguished from the future-time xlog segments
> that you chose not to apply. I don't recall what the DB2 terminology
> was exactly --- not "branch number" I don't think --- but anyway the
> idea is that when you restart the database after an incomplete recovery,
> you are now in a sort of parallel universe that has its own history
> after the branch point (PITR stop point). You need to be able to
> distinguish archived log segments of this parallel universe from those
> of previous and subsequent incarnations.

Thats a good idea, if only because you so easily screw your test data
during multiple recovery situations. But if its good during testing, it
must be good in production too...since you may well perform
recovery...run for a while, then discover that you got it wrong first
time, then need to re-recover again. I already added that to my list of
gotchas and that would solve it.

I was going to say hats off to the Blue-hued ones, when I remembered
this little gem from last year
http://www.danskebank.com/link/ITreport20030403uk/$file/ITreport20030403uk.pdf

> I'm not sure whether Vadim
> intended our StartUpID to serve this purpose, but it could perhaps be
> used that way, if we reflected it in the WAL file names.
>

Well, I'm not sure about StartUpId....but certainly the high 2 bytes of
LogId looks pretty certain never to be anything but zeros. You have 2.4
x 10^14...which is 9,000 years at 1000 log file/sec
We could use the scheme you descibe:
add xFFFF to the logid every time you complete an archive recovery...so
the log files look like 0001000000000CE3 after youve recovered a load of
files that look like 0000000000000CE3

If you used StartUpID directly, you might just run out....but its very
unlikely you would ever perform 65000 recovery situations - unless
you've run the <expletive> code as often as I have :(.

Doing that also means we don't have to work out how to do that with
StartUpID. Of course, altering the length and makeup of the xlog files
is possible too, but that will cause other stuff to stop working....

[We'll have to give this a no-SciFi name, unless we want to make
in-roads into the Dr.Who fanbase :) Don't get them started. Better
still, dont give it a name at all.]

I'll sleep on that lot.

Best regards, Simon Riggs

From:	Richard Huxton <dev(at)archonet(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-06 19:00:51
Message-ID:	40EAF6E3.9090508@archonet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs wrote:
> On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
>
>>Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>>
>>>Should we use a different datatype than time_t for the commit timestamp,
>>>one that offers more fine grained differentiation between checkpoints?
>>
>>Pretty much everybody supports gettimeofday() (time_t and separate
>>integer microseconds); you might as well use that. Note that the actual
>>resolution is not necessarily microseconds, and it'd still not be
>>certain that successive commits have distinct timestamps --- so maybe
>>this refinement would be pointless. You'll still have to design a user
>>interface that allows selection without the assumption of distinct
>>timestamps.
>
>
> Well, I agree, though without the desired-for UI now, I think some finer
> grained mechanism would be good. This means extending the xlog commit
> record by a couple of bytes...OK, lets live a little.

At the risk of irritating people, I'll repeat what I suggested a few
weeks ago...

Add a table: pg_pitr_checkpt (pitr_id SERIAL, pitr_ts timestamptz,
pitr_comment text)
Let the user insert rows in transactions as desired. Let them stop the
restore when a specific (pitr_ts,pitr_comment) gets inserted (or on
pitr_id if they record it).

IMHO time is seldom relevant, event boundaries are.

If you want to add special syntax for this, fine. If not, an INSERT
statement is a convenient way to do this anyway.

--
Richard Huxton
Archonet Ltd

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Richard Huxton <dev(at)archonet(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-06 21:39:44
Message-ID:	1089149984.17493.268.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Tue, 2004-07-06 at 20:00, Richard Huxton wrote:
> Simon Riggs wrote:
> > On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
> >
> >>Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> >>
> >>>Should we use a different datatype than time_t for the commit timestamp,
> >>>one that offers more fine grained differentiation between checkpoints?
> >>
> >>Pretty much everybody supports gettimeofday() (time_t and separate
> >>integer microseconds); you might as well use that. Note that the actual
> >>resolution is not necessarily microseconds, and it'd still not be
> >>certain that successive commits have distinct timestamps --- so maybe
> >>this refinement would be pointless. You'll still have to design a user
> >>interface that allows selection without the assumption of distinct
> >>timestamps.
> >
> >
> > Well, I agree, though without the desired-for UI now, I think some finer
> > grained mechanism would be good. This means extending the xlog commit
> > record by a couple of bytes...OK, lets live a little.
>
> At the risk of irritating people, I'll repeat what I suggested a few
> weeks ago...
>

All feedback is good. Thanks.

> Add a table: pg_pitr_checkpt (pitr_id SERIAL, pitr_ts timestamptz,
> pitr_comment text)
> Let the user insert rows in transactions as desired. Let them stop the
> restore when a specific (pitr_ts,pitr_comment) gets inserted (or on
> pitr_id if they record it).
>

It's a good plan, but the recovery is currently offline recovery and no
SQL is possible. So no way to insert, no way to access tables until
recovery completes. I like that plan and probably would have used it if
it was viable.

> IMHO time is seldom relevant, event boundaries are.
>

Agreed, but time is the universally agreed way of describing two events
as being simultaneous. No other way to say "recover to the point when
the message queue went wild".

As of last post to Andreas, I've said I'll not bother changing the
granularity of the timestamp.

> If you want to add special syntax for this, fine. If not, an INSERT
> statement is a convenient way to do this anyway.

The special syntax isn't hugely important - I did suggest a kind of
SQL-like syntax previously, but thats gone now. Invoking recovery via a
command file IS, so we are able to tell the system its not in crash
recovery AND that when you've finished I want you to respond to crashes
without re-entering archive recovery.

Thanks for your comments. I'm not making this more complex than needs
be; in fact much of the code is very simple - its just the planning
that's complex.

Best regards, Simon Riggs

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, mascarm(at)mascari(dot)com, ZeugswetterA(at)spardat(dot)at
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-06 21:40:06
Message-ID:	1089150005.17493.270.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:

> > - when we stop, keep reading records until EOF, just don't apply them.
> > When we write a checkpoint at end of recovery, the unapplied
> > transactions are buried alive, never to return.
> > - stop where we stop, then force zeros to EOF, so that no possible
> > record remains of previous transactions.
>
> Go with plan B; it's best not to destroy data (what if you chose the
> wrong restart point the first time)?
>
> Actually this now reminds me of a discussion I had with Patrick
> Macdonald some time ago. The DB2 practice in this connection is that
> you *never* overwrite existing logfile data when recovering. Instead
> you start a brand new xlog segment file, which is given a new "branch
> number" so it can be distinguished from the future-time xlog segments
> that you chose not to apply. I don't recall what the DB2 terminology
> was exactly --- not "branch number" I don't think --- but anyway the
> idea is that when you restart the database after an incomplete recovery,
> you are now in a sort of parallel universe that has its own history
> after the branch point (PITR stop point). You need to be able to
> distinguish archived log segments of this parallel universe from those
> of previous and subsequent incarnations. I'm not sure whether Vadim
> intended our StartUpID to serve this purpose, but it could perhaps be
> used that way, if we reflected it in the WAL file names.
>

Some more thoughts...focusing on the what do we do after we've finished
recovering. The objectives, as I see them, are to put the system into a
state, that preserves these features:
1. we never overwrite files, in case we want to re-run recovery
2. we never write files that MIGHT have been written previously
3. we need to ensure that any xlog records skipped at admins request (in
PITR mode) are never in a position to be re-applied to this timeline.
4. ensure we can re-recover, if we need to, without further problems

Tom's concept above, I'm going to call timelines. A timeline is the
sequence of logs created by the execution of a server. If you recover
the database, you create a new timeline. [This is because, if you've
invoked PITR you absolutely definitely want log records written to, say,
xlog15 to be different to those that were written to xlog15 in a
previous timeline that you have chosen not to reapply.]

Objective (1) is complex.
When we are restoring, we always start with archived copies of the xlog,
to make sure we don't finish too soon. We roll forward until we either
reach PITR stop point, or we hit end of archived logs. If we hit end of
logs on archive, then we switch to a local copy, if one exists that is
higher than those, we carry on rolling forward until either we reach
PITR stop point, or we hit end of that log. (Hopefully, there isn't more
than one local xlog higher than the archive, but its possible).
If we are rolling forward on local copies, then they are our only
copies. We'd really like to archive them ASAP, but the archiver's not
running yet - we don't want to force that situation in case the archive
device (say a tape) is the one being used to recover right now. So we
write an archive_status of .ready for that file, ensuring that the
checkpoint won't remove it until it gets copied to archive, whenever
that starts working again. Objective (1) met.

When we have finished recovering we:
- create a new xlog at the start of a new ++timeline
- copy the last applied xlog record to it as the first record
- set the record pointer so that it matches
That way, when we come up and begin running, we never overwrite files
that might have been written previously. Objective (2) met.
We do the other stuff because recovery finishes up by pointing to the
last applied record...which is what was causing all of this extra work
in the first place.

At this point, we also reset the secondary checkpoint record, so that
should recovery be required again before next checkpoint AND the
shutdown checkpoint record written after recovery completes is
wrong/damaged, the recovery will not autorewind back past the PITR stop
point and attempt to recover the records we have just tried so hard to
reverse/ignore. Objective (3) met. (Clearly, that situation seems
unlikely, but I feel we must deal with it...a newly restored system is
actually very fragile, so a crash again within 3 minutes or so is very
commonplace, as far as these things go).

Should we need to re-recover, we can do so because the new timeline
xlogs are further forward than the old timeline, so never get seen by
any processes (all of which look backwards). Re-recovery is possible
without problems, if required. This means you're a lot safer from some
of the mistakes you might of made, such as deciding you need to go into
recovery, then realising it wasn't required (or some other painful
flapping as goes on in computer rooms at 3am).

How do we implement timelines?
The main presumption in the code is that xlogs are sequential. That has
two effects:
1. during recovery, we try to open the "next" xlog by adding one to the
numbers and then looking for that file
2. during checkpoint, we look for filenames less than the current
checkpoint marker
Creating a timeline by adding a larger number to LogId allows us to
prevent (1) from working, yet without breaking (2).
Well, Tom does seem to have something with regard to StartUpIds. I feel
it is easier to force a new timeline by adding a very large number to
the LogId IF, and only if, we have performed an archive recovery. That
way, we do not change at all the behaviour of the system for people that
choose not to implement archive_mode.

Should we implement timelines?
Yes, I think we should. I've already hit the problems that timelines
solve in my testing and so that means they'll be hit when you don't need
the hassle.

Comments much appreciated, assuming you read this far...

Best regards, Simon Riggs

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	mascarm(at)mascari(dot)com, ZeugswetterA(at)spardat(dot)at, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-13 11:38:23
Message-ID:	1089718702.17493.2527.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Tue, 2004-07-06 at 22:40, Simon Riggs wrote:
> On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
> > Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>
> > > - when we stop, keep reading records until EOF, just don't apply them.
> > > When we write a checkpoint at end of recovery, the unapplied
> > > transactions are buried alive, never to return.
> > > - stop where we stop, then force zeros to EOF, so that no possible
> > > record remains of previous transactions.
> >
> > Go with plan B; it's best not to destroy data (what if you chose the
> > wrong restart point the first time)?
> >
> > Actually this now reminds me of a discussion I had with Patrick
> > Macdonald some time ago. The DB2 practice in this connection is that
> > you *never* overwrite existing logfile data when recovering. Instead
> > you start a brand new xlog segment file, which is given a new "branch
> > number" so it can be distinguished from the future-time xlog segments
> > that you chose not to apply. I don't recall what the DB2 terminology
> > was exactly --- not "branch number" I don't think --- but anyway the
> > idea is that when you restart the database after an incomplete recovery,
> > you are now in a sort of parallel universe that has its own history
> > after the branch point (PITR stop point). You need to be able to
> > distinguish archived log segments of this parallel universe from those
> > of previous and subsequent incarnations. I'm not sure whether Vadim
> > intended our StartUpID to serve this purpose, but it could perhaps be
> > used that way, if we reflected it in the WAL file names.
> >
>
> Some more thoughts...focusing on the what do we do after we've finished
> recovering. The objectives, as I see them, are to put the system into a
> state, that preserves these features:
> 1. we never overwrite files, in case we want to re-run recovery
> 2. we never write files that MIGHT have been written previously
> 3. we need to ensure that any xlog records skipped at admins request (in
> PITR mode) are never in a position to be re-applied to this timeline.
> 4. ensure we can re-recover, if we need to, without further problems
>
> Tom's concept above, I'm going to call timelines. A timeline is the
> sequence of logs created by the execution of a server. If you recover
> the database, you create a new timeline. [This is because, if you've
> invoked PITR you absolutely definitely want log records written to, say,
> xlog15 to be different to those that were written to xlog15 in a
> previous timeline that you have chosen not to reapply.]
>
> Objective (1) is complex.
> When we are restoring, we always start with archived copies of the xlog,
> to make sure we don't finish too soon. We roll forward until we either
> reach PITR stop point, or we hit end of archived logs. If we hit end of
> logs on archive, then we switch to a local copy, if one exists that is
> higher than those, we carry on rolling forward until either we reach
> PITR stop point, or we hit end of that log. (Hopefully, there isn't more
> than one local xlog higher than the archive, but its possible).
> If we are rolling forward on local copies, then they are our only
> copies. We'd really like to archive them ASAP, but the archiver's not
> running yet - we don't want to force that situation in case the archive
> device (say a tape) is the one being used to recover right now. So we
> write an archive_status of .ready for that file, ensuring that the
> checkpoint won't remove it until it gets copied to archive, whenever
> that starts working again. Objective (1) met.
>
> When we have finished recovering we:
> - create a new xlog at the start of a new ++timeline
> - copy the last applied xlog record to it as the first record
> - set the record pointer so that it matches
> That way, when we come up and begin running, we never overwrite files
> that might have been written previously. Objective (2) met.
> We do the other stuff because recovery finishes up by pointing to the
> last applied record...which is what was causing all of this extra work
> in the first place.
>
> At this point, we also reset the secondary checkpoint record, so that
> should recovery be required again before next checkpoint AND the
> shutdown checkpoint record written after recovery completes is
> wrong/damaged, the recovery will not autorewind back past the PITR stop
> point and attempt to recover the records we have just tried so hard to
> reverse/ignore. Objective (3) met. (Clearly, that situation seems
> unlikely, but I feel we must deal with it...a newly restored system is
> actually very fragile, so a crash again within 3 minutes or so is very
> commonplace, as far as these things go).
>
> Should we need to re-recover, we can do so because the new timeline
> xlogs are further forward than the old timeline, so never get seen by
> any processes (all of which look backwards). Re-recovery is possible
> without problems, if required. This means you're a lot safer from some
> of the mistakes you might of made, such as deciding you need to go into
> recovery, then realising it wasn't required (or some other painful
> flapping as goes on in computer rooms at 3am).
>
> How do we implement timelines?
> The main presumption in the code is that xlogs are sequential. That has
> two effects:
> 1. during recovery, we try to open the "next" xlog by adding one to the
> numbers and then looking for that file
> 2. during checkpoint, we look for filenames less than the current
> checkpoint marker
> Creating a timeline by adding a larger number to LogId allows us to
> prevent (1) from working, yet without breaking (2).
> Well, Tom does seem to have something with regard to StartUpIds. I feel
> it is easier to force a new timeline by adding a very large number to
> the LogId IF, and only if, we have performed an archive recovery. That
> way, we do not change at all the behaviour of the system for people that
> choose not to implement archive_mode.
>
> Should we implement timelines?
> Yes, I think we should. I've already hit the problems that timelines
> solve in my testing and so that means they'll be hit when you don't need
> the hassle.
>

I'm still wrestling with the cleanup-after-stopping-at-point-in-time
code and have some important conclusions.

Moving forward on a timeline is somewhat tricky for xlogs, as shown
above,...but...

My earlier treatment seems to have neglected to include the clog also.
If we stop before end of log, then we also have potentially many (though
presumably at least one) committed transactions that we do not want to
be told about ever again.

The starting a new timeline thought works for xlogs, but not for clogs.
No matter how far you go into the future, there is a small (yet
vanishing) possibility that there is a yet undiscovered committed
transaction in the future. (Because transactions are ordered in the clog
because xids are assigned sequentially at txn start, but not ordered in
the xlog where they are recorded in the order the txns complete).

Please tell me that we can ignore the state of the clog, but I think we
can't - if a new xid re-used a previous xid that had committed AND then
we crashed...we would have inconsistent data. Unless we physically write
zeros to clog for every begin transaction after a recovery...err, no...

The only recourse that I can see is to "truncate the future" of the
clog, which would mean:
- keeping track of the highest xid provided by any record from the xlog,
in xact.c, xact_redo
- using that xid to write zeros to the clog after this point until EOF
- drop any clog segment files past the new "high" segment
- no idea how that effects NT or not...

The timeline idea works for xlog because once we've applied the xlog
records and checkpointed, we can discard the xlog records. We can't do
that with clog records (unless we followed recovery with a vacuum full -
which is possible, but not hugely desirable) - though this doesn't solve
the issue that xlog records don't have any prescribed position in the
file, clog records do.

Right now, I don't know where to start with the clog code and the
opportunity for code-overlap with NT seems way high. These problems can
be conquered, given time and "given enough eyeballs".

I'm all ears for some bright ideas...but I'm getting pretty wary that we
may introduce some unintended features if we try to get this stabilised
within two weeks. My current conclusion is: lets commit archive recovery
in this release, then wait until next dot release for full recovery
target features. We've hit all the features which were a priority and
the fundamental architecture is there, so i think it is time to be happy
with what we've got, for now.

Comments, please....remembering that I'd love it if I've missed
something that simplifies the task. Fire away.

Best regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, mascarm(at)mascari(dot)com, ZeugswetterA(at)spardat(dot)at, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-13 14:29:44
Message-ID:	15780.1089728984@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Please tell me that we can ignore the state of the clog,

We can.

The reason that keeping track of timelines is interesting for xlog is
simply to take pity on the poor DBA who needs to distinguish the various
archived xlog files he's got laying about, and so that we can detect
errors like supplying inconsistent sets of xlog segments during restore.

This does not apply to clog because it's not archived. It's no more
than a data file. If you think you have trouble recreating clog then
you have the same issues recreating data files.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-13 20:25:50
Message-ID:	1089750349.17493.3092.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Tue, 2004-07-13 at 15:29, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > Please tell me that we can ignore the state of the clog,
>
> We can.
>

In general, you are of course correct.

> The reason that keeping track of timelines is interesting for xlog is
> simply to take pity on the poor DBA who needs to distinguish the various
> archived xlog files he's got laying about, and so that we can detect
> errors like supplying inconsistent sets of xlog segments during restore.
>
> This does not apply to clog because it's not archived. It's no more
> than a data file. If you think you have trouble recreating clog then
> you have the same issues recreating data files.

I'm getting carried away with the improbable....but this is the rather
strange, but possible scenario I foresee:

A sequence of times...
1. We start archiving xlogs
2. We take a checkpoint
3. we commit an important transaction
4. We take a backup
5. We take a checkpoint

As stands currently, when we restore the backup, controlfile says that
last checkpoint was at 2, so we rollforward from 2 THRU 4 and continue
on past 5 until end of logs. Normally, end of logs isn't until after
4...

When we specify a recovery target, it is possible to specify the
rollforward to complete just before point 3. So we use the backup taken
at 4 to rollforward to a point in the past (from the backups
perspective). The backup taken at 4 may now have data and clog records
written by bgwriter.

Given that time between checkpoints is likely to be longer than
previously was the case...this becomes a non-zero situation.

I was trying to solve this problem head on, but the best way is to make
sure we never allow ourselves such a muddled situation:

ISTM the way to avoid this is to insist that we always rollforward
through at least one checkpoint to guarantee that this will not occur.

...then I can forget I ever mentioned the ****** clog again.

I'm ignoring this issue for now....whether it exists or not!

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-13 21:19:56
Message-ID:	18739.1089753596@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> I'm getting carried away with the improbable....but this is the rather
> strange, but possible scenario I foresee:

> A sequence of times...
> 1. We start archiving xlogs
> 2. We take a checkpoint
> 3. we commit an important transaction
> 4. We take a backup
> 5. We take a checkpoint

> When we specify a recovery target, it is possible to specify the
> rollforward to complete just before point 3.

No, it isn't possible. The recovery *must* proceed at least as far as
wherever the end of the log was at the time the backup was completed.
Otherwise everything is broken, not only clog, because you may have disk
blocks in your backup that postdate where you stopped log replay.

To have a consistent recovery at all, you must replay the log starting
from a checkpoint before the backup began and extending to the time that
the backup finished. You only get to decide where to stop after that
point.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-13 22:25:34
Message-ID:	1089757534.17493.3217.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Tue, 2004-07-13 at 22:19, Tom Lane wrote:

> To have a consistent recovery at all, you must replay the log starting
> from a checkpoint before the backup began and extending to the time that
> the backup finished. You only get to decide where to stop after that
> point.
>

So the situation is:
- You must only stop recovery at a point in time (in the logs) after the
backup had completed.

No way to enforce that currently, apart from procedurally. Not exactly
frequent, so I think I just document that and move on, eh?

Thanks for your help,

Best regards, Simon Riggs

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-13 22:42:42
Message-ID:	200407132242.i6DMgg612564@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs wrote:
> On Tue, 2004-07-13 at 22:19, Tom Lane wrote:
>
> > To have a consistent recovery at all, you must replay the log starting
> > from a checkpoint before the backup began and extending to the time that
> > the backup finished. You only get to decide where to stop after that
> > point.
> >
>
> So the situation is:
> - You must only stop recovery at a point in time (in the logs) after the
> backup had completed.
>
> No way to enforce that currently, apart from procedurally. Not exactly
> frequent, so I think I just document that and move on, eh?

If it happens, could you use your previous full backup and the PITR logs
from before stop stopped logging, and then after? Is there a period
where they could not restore reliably?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-13 22:52:34
Message-ID:	1089759153.17493.3270.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Tue, 2004-07-13 at 23:42, Bruce Momjian wrote:
> Simon Riggs wrote:
> > On Tue, 2004-07-13 at 22:19, Tom Lane wrote:
> >
> > > To have a consistent recovery at all, you must replay the log starting
> > > from a checkpoint before the backup began and extending to the time that
> > > the backup finished. You only get to decide where to stop after that
> > > point.
> > >
> >
> > So the situation is:
> > - You must only stop recovery at a point in time (in the logs) after the
> > backup had completed.
> >
> > No way to enforce that currently, apart from procedurally. Not exactly
> > frequent, so I think I just document that and move on, eh?
>
> If it happens, could you use your previous full backup and the PITR logs
> from before stop stopped logging, and then after?

Yes.

> Is there a period
> where they could not restore reliably?

Good question. No is the answer.

The situation is that the backup isn't timestamped with respect to the
logs, so its possible to attempt to use the wrong backup for recovery.

The solution is procedural - make sure you timestamp your backup files,
so you know which ones to recover with...

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-13 22:56:31
Message-ID:	19894.1089759391@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> So the situation is:
> - You must only stop recovery at a point in time (in the logs) after the
> backup had completed.

Right.

> No way to enforce that currently, apart from procedurally. Not exactly
> frequent, so I think I just document that and move on, eh?

The procedure that generates a backup has got to be responsible for
recording both the start and stop times. If it does not do so then
it's fatally flawed. (Note also that you had better be careful to get
the time as seen on the server machine's clock ... this could be a nasty
gotcha if the backup is run on a different machine, such as an NFS
server.)

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-13 23:01:41
Message-ID:	200407132301.i6DN1fb16205@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > So the situation is:
> > - You must only stop recovery at a point in time (in the logs) after the
> > backup had completed.
>
> Right.
>
> > No way to enforce that currently, apart from procedurally. Not exactly
> > frequent, so I think I just document that and move on, eh?
>
> The procedure that generates a backup has got to be responsible for
> recording both the start and stop times. If it does not do so then
> it's fatally flawed. (Note also that you had better be careful to get
> the time as seen on the server machine's clock ... this could be a nasty
> gotcha if the backup is run on a different machine, such as an NFS
> server.)

OK, but procedurally, how do you correlate the start/stop time of the
tar backup with the WAL numeric file names?

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-13 23:13:37
Message-ID:	1089760416.17493.3327.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

PITR Patch v5_1 just posted has Point in Time Recovery working....

Still some rough edges....but we really need some testers now to give
this a try and let me know what you think.

Klaus Naumann and Mark Wong are the only [non-committers] to have tried
to run the code (and let me know about it), so please have a look at
[PATCHES] and try it out.

Many thanks,

Simon Riggs

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-13 23:21:28
Message-ID:	1089760888.17493.3347.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Wed, 2004-07-14 at 00:01, Bruce Momjian wrote:
> Tom Lane wrote:
> > Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > > So the situation is:
> > > - You must only stop recovery at a point in time (in the logs) after the
> > > backup had completed.
> >
> > Right.
> >
> > > No way to enforce that currently, apart from procedurally. Not exactly
> > > frequent, so I think I just document that and move on, eh?
> >
> > The procedure that generates a backup has got to be responsible for
> > recording both the start and stop times. If it does not do so then
> > it's fatally flawed. (Note also that you had better be careful to get
> > the time as seen on the server machine's clock ... this could be a nasty
> > gotcha if the backup is run on a different machine, such as an NFS
> > server.)
>
> OK, but procedurally, how do you correlate the start/stop time of the
> tar backup with the WAL numeric file names?

No need. You just correlate the recovery target with the backup file
times. Mostly, you'll only ever use your last backup and won't need to
fuss with the times.

Backup should begin with a CHECKPOINT...then wait for that to complete,
just to make the backup as current as possible.

If you want to start purging your archives of old archived xlogs, you
can use the filedate (assuming you preserved that on your copy to
archive - but even if not, they'll be fairly close).

Best regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-13 23:28:21
Message-ID:	20205.1089761301@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> OK, but procedurally, how do you correlate the start/stop time of the
> tar backup with the WAL numeric file names?

Ideally the procedure for making a backup would go something like:

1. Inquire of the server its current time and the WAL position of the
most recent checkpoint record (which is what you really need).

2. Make the backup.

3. Inquire of the server its current time and the current end-of-WAL
position.

4. Record items 1 and 3 along with the backup itself.

I think the current theory was you could fake #1 by copying pg_control
before everything else, but this doesn't really help for step #3, so
it would probably be better to add some server functions to get this
info.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-13 23:54:05
Message-ID:	1089762845.17493.3417.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Wed, 2004-07-14 at 00:28, Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > OK, but procedurally, how do you correlate the start/stop time of the
> > tar backup with the WAL numeric file names?
>
> Ideally the procedure for making a backup would go something like:
>
> 1. Inquire of the server its current time and the WAL position of the
> most recent checkpoint record (which is what you really need).
>
> 2. Make the backup.
>
> 3. Inquire of the server its current time and the current end-of-WAL
> position.
>
> 4. Record items 1 and 3 along with the backup itself.
>
> I think the current theory was you could fake #1 by copying pg_control
> before everything else, but this doesn't really help for step #3, so
> it would probably be better to add some server functions to get this
> info.
>

err...I think at this point we should review the PITR patch....

The recovery mechanism doesn't rely upon you knowing 1 or 3. The
recovery reads pg_control (from the backup) and then attempts to
de-archive the appropriate xlog segment file and then starts rollforward
from there. Effectively, restore assumes it has access to an infinite
timeline of logs....which clearly isn't the case, but its up to *you* to
check that you have the logs that go with the backups. (Or put another
way, if this sounds hard, buy some software that administers the
procedure for you). That's the mechanism that allows "infinite
recovery".

In brief, the code path is as identical as possible to the current crash
recovery situation...archive recovery restores the files from archive
when they are needed, just as if they had always been in pg_xlog, in a
way that ensures pg_xlog never runs out of space.

Recovery ends when: it reaches the recovery target you specified, or it
runs out of xlogs (first it runs out of archived xlogs, then tries to
find a more recent local copy if there is one).

I think the current theory was you could fake #1 by copying pg_control
> before everything else, but this doesn't really help for step #3, so
> it would probably be better to add some server functions to get this
> info.

Not sure what you mean by "fake"....

Best Regards, Simon Riggs

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-14 02:31:05
Message-ID:	40F49AE9.2060805@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Can you give us some suggestions of what kind of stuff to test? Is
there a way we can artificially kill the backend in all sorts of nasty
spots to see if recovery works? Does kill -9 simulate a 'power off'?

Chris

Simon Riggs wrote:

> PITR Patch v5_1 just posted has Point in Time Recovery working....
>
> Still some rough edges....but we really need some testers now to give
> this a try and let me know what you think.
>
> Klaus Naumann and Mark Wong are the only [non-committers] to have tried
> to run the code (and let me know about it), so please have a look at
> [PATCHES] and try it out.
>
> Many thanks,
>
> Simon Riggs
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-14 07:21:22
Message-ID:	1089789682.17493.3890.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Wed, 2004-07-14 at 03:31, Christopher Kings-Lynne wrote:
> Can you give us some suggestions of what kind of stuff to test? Is
> there a way we can artificially kill the backend in all sorts of nasty
> spots to see if recovery works? Does kill -9 simulate a 'power off'?
>

I was hoping some fiendish plans would be presented to me...

But please start with "this feels like typical usage" and we'll go from
there...the important thing is to try the first one.

I've not done power off tests, yet. They need to be done just to
check...actually you don't need to do this to test PITR...

We need to exhaustive tests of...
- power off
- scp and cross network copies
- all the permuted recovery options
- archive_mode = off (i.e. current behaviour)
- deliberately incorrectly set options (idiot-proof testing)

I'd love some help assembling a test document with numbered tests...

Best regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, pgsql-hackers(at)postgresql(dot)org, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-14 13:20:08
Message-ID:	25903.1089811208@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> I've not done power off tests, yet. They need to be done just to
> check...actually you don't need to do this to test PITR...

I agree, power off is not really the point here. What we need to check
into is (a) the mechanics of archiving WAL segments and (b) the
process of restoring given a backup and a bunch of WAL segments.

regards, tom lane

From:	markw(at)osdl(dot)org
To:	simon(at)2ndquadrant(dot)com
Cc:	pgman(at)candle(dot)pha(dot)pa(dot)us, kn(at)mgnet(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-14 15:55:52
Message-ID:	200407141555.i6EFtnk22984@mail.osdl.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On 14 Jul, Simon Riggs wrote:
> PITR Patch v5_1 just posted has Point in Time Recovery working....
>
> Still some rough edges....but we really need some testers now to give
> this a try and let me know what you think.
>
> Klaus Naumann and Mark Wong are the only [non-committers] to have tried
> to run the code (and let me know about it), so please have a look at
> [PATCHES] and try it out.
>
> Many thanks,
>
> Simon Riggs

Simon,

I just tried applying the v5_1 patch against the cvs tip today and got a
couple of rejections. I'll copy the patch output here. Let me know if
you want to see the reject files or anything else:

$ patch -p0 < ../../../pitr-v5_1.diff
patching file backend/access/nbtree/nbtsort.c
Hunk #2 FAILED at 221.
1 out of 2 hunks FAILED -- saving rejects to file backend/access/nbtree/nbtsort.c.rej
patching file backend/access/transam/xlog.c
Hunk #11 FAILED at 1802.
Hunk #15 FAILED at 2152.
Hunk #16 FAILED at 2202.
Hunk #21 FAILED at 3450.
Hunk #23 FAILED at 3539.
Hunk #25 FAILED at 3582.
Hunk #26 FAILED at 3833.
Hunk #27 succeeded at 3883 with fuzz 2.
Hunk #28 FAILED at 4446.
Hunk #29 succeeded at 4470 with fuzz 2.
8 out of 29 hunks FAILED -- saving rejects to file backend/access/transam/xlog.c.rej
patching file backend/postmaster/Makefile
patching file backend/postmaster/postmaster.c
Hunk #3 succeeded at 1218 with fuzz 2 (offset 70 lines).
Hunk #4 succeeded at 1827 (offset 70 lines).
Hunk #5 succeeded at 1874 (offset 70 lines).
Hunk #6 succeeded at 1894 (offset 70 lines).
Hunk #7 FAILED at 1985.
Hunk #8 succeeded at 2039 (offset 70 lines).
Hunk #9 succeeded at 2236 (offset 70 lines).
Hunk #10 succeeded at 2996 with fuzz 2 (offset 70 lines).
1 out of 10 hunks FAILED -- saving rejects to file backend/postmaster/postmaster.c.rej
patching file backend/storage/smgr/md.c
Hunk #1 succeeded at 162 with fuzz 2.
patching file backend/utils/misc/guc.c
Hunk #1 succeeded at 342 (offset 9 lines).
Hunk #2 succeeded at 1387 (offset 9 lines).
patching file backend/utils/misc/postgresql.conf.sample
Hunk #1 succeeded at 113 (offset 10 lines).
patching file bin/initdb/initdb.c
patching file include/access/xlog.h
patching file include/storage/pmsignal.h

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	markw(at)osdl(dot)org
Cc:	pgman(at)candle(dot)pha(dot)pa(dot)us, kn(at)mgnet(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-14 19:33:40
Message-ID:	1089833620.17493.4624.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Wed, 2004-07-14 at 16:55, markw(at)osdl(dot)org wrote:
> On 14 Jul, Simon Riggs wrote:
> > PITR Patch v5_1 just posted has Point in Time Recovery working....
> >
> > Still some rough edges....but we really need some testers now to give
> > this a try and let me know what you think.
> >
> > Klaus Naumann and Mark Wong are the only [non-committers] to have tried
> > to run the code (and let me know about it), so please have a look at
> > [PATCHES] and try it out.
> >

> I just tried applying the v5_1 patch against the cvs tip today and got a
> couple of rejections. I'll copy the patch output here. Let me know if
> you want to see the reject files or anything else:
>

I'm on it. Sorry 'bout that all - midnight fingers.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	markw(at)osdl(dot)org
Cc:	pgman(at)candle(dot)pha(dot)pa(dot)us, kn(at)mgnet(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-14 22:56:39
Message-ID:	1089845797.17493.4987.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Wed, 2004-07-14 at 20:33, Simon Riggs wrote:
> On Wed, 2004-07-14 at 16:55, markw(at)osdl(dot)org wrote:
> > On 14 Jul, Simon Riggs wrote:
> > > PITR Patch v5_1 just posted has Point in Time Recovery working....
> > >
> > > Still some rough edges....but we really need some testers now to give
> > > this a try and let me know what you think.
> > >
> > > Klaus Naumann and Mark Wong are the only [non-committers] to have tried
> > > to run the code (and let me know about it), so please have a look at
> > > [PATCHES] and try it out.
> > >
>
> > I just tried applying the v5_1 patch against the cvs tip today and got a
> > couple of rejections. I'll copy the patch output here. Let me know if
> > you want to see the reject files or anything else:
> >
>
> I'm on it. Sorry 'bout that all - midnight fingers.

Latest version, pitr_v5_2.patch...

- Updated to cvs tip
- Additional tip changes located and patched
- Full re-test of both recover to point in time and recover to xid
- 2 additional bug fixes
- corrected recovery.conf sample
- Patch test
- Patch manually inspected

(pgarch.c, pgarch.h and README identical to previous post)

Go for it...

Best regards, Simon

Attachment	Content-Type	Size
pitr_v5_2.patch	text/x-patch	57.9 KB
recovery.conf.sample	text/plain	2.3 KB
pgarch.c	text/x-c	13.7 KB
pgarch.h	text/x-c-header	481 bytes
README	text/html	11.6 KB

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	markw(at)osdl(dot)org, pgman(at)candle(dot)pha(dot)pa(dot)us, kn(at)mgnet(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-15 01:43:38
Message-ID:	40F5E14A.4090003@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

I noticed that compiling with 5_1 patch applied fails due to
XLOG_archive_dir being removed from xlog.c , but
src/backend/commands/tablecmds.c still uses it.

I did the following to tablecmds.c :

5408c5408
< extern char XLOG_archive_dir[];
---
> extern char *XLogArchiveDest;
5410c5410
< use_wal = XLOG_archive_dir[0] && !rel->rd_istemp;
---
> use_wal = XLogArchiveDest[0] && !rel->rd_istemp;

Now I have to see if I have broken it with this change :-)

regards

Mark

Simon Riggs wrote:

>On Wed, 2004-07-14 at 16:55, markw(at)osdl(dot)org wrote:
>
>
>>On 14 Jul, Simon Riggs wrote:
>>
>>
>>>PITR Patch v5_1 just posted has Point in Time Recovery working....
>>>
>>>Still some rough edges....but we really need some testers now to give
>>>this a try and let me know what you think.
>>>
>>>Klaus Naumann and Mark Wong are the only [non-committers] to have tried
>>>to run the code (and let me know about it), so please have a look at
>>>[PATCHES] and try it out.
>>>
>>>
>>>
>
>
>
>>I just tried applying the v5_1 patch against the cvs tip today and got a
>>couple of rejections. I'll copy the patch output here. Let me know if
>>you want to see the reject files or anything else:
>>
>>
>>
>
>I'm on it. Sorry 'bout that all - midnight fingers.
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html
>
>

From:	SAKATA Tetsuo <sakata(dot)tetsuo(at)lab(dot)ntt(dot)co(dot)jp>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-15 01:49:21
Message-ID:	40F5E2A1.7080908@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Hi, folks.

My colleages and I are planning to test PITR after the 7.5 beta release.
Now we are desinging test items, but some specification are enough clear
(to us).

For example, we are not clear which resouce manager order to store log
records.

- some access method (like B-tree) require to log its date or not.
- create/drop action of table space to be stored to the log or not.

We'll be pleased if someone informs them.

The test set we'll proceed has following items;

- PITR can recover ordinary commited transaction's data.
- tuple data themselves
- index data associated with them
- PITR can recover commited some special transaction's data.
- DDL; create database, table, index and so on
- maintenance commands (handling large amount of data);
truncate, vacuum, reindex and so on.

Items above are 'data aspects' of the test. Other aspects are as follows

- Place of the archival log's drive;
PITR can recover a database from archived log data
- stored in the same drive as xlog.
- stored in a different drive on the same machine
in which the PostgreSQL runs.
- stored in a different drive on a different machine.

- Duration between a checkpoint and recovery;
PITR can recover a database enough long after a checkpoint.

- Time to Recover;
- to end of the log.
- to some specified time.

- Type of failures;
- system down --- kill the PostgreSQL process (as a simulation).
- media lost --- delete database files (as a simulation).
- These two case will be tested by a simulated situation first,
and we would try some 'real' failure after.
(real power down of the test machine to the first case,
and 'plug off' the disk drive to the second one.
these action would damage test machine, this is because
we plan them after 'ordinary' test items.)

The test set is under construction and we'll test the 7.5 beta
for some weeks, and report the result of the test here.

Sincerely yours.
Tetsuo SAKATA.

--
sakata.tetsuo _at_ lab.ntt.co.jp
SAKATA, Tetsuo. Yokosuka JAPAN.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc:	markw(at)osdl(dot)org, pgman(at)candle(dot)pha(dot)pa(dot)us, kn(at)mgnet(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-15 07:15:03
Message-ID:	1089875702.17493.5539.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Thu, 2004-07-15 at 02:43, Mark Kirkwood wrote:
> I noticed that compiling with 5_1 patch applied fails due to
> XLOG_archive_dir being removed from xlog.c , but
> src/backend/commands/tablecmds.c still uses it.
>
> I did the following to tablecmds.c :
>
> 5408c5408
> < extern char XLOG_archive_dir[];
> ---
> > extern char *XLogArchiveDest;
> 5410c5410
> < use_wal = XLOG_archive_dir[0] && !rel->rd_istemp;
> ---
> > use_wal = XLogArchiveDest[0] && !rel->rd_istemp;
>
>

Yes, I discovered that myself.

The fix is included in pitr_v5_2.patch...

Your patch follows the right thinking and looks like it would have
worked...
- XLogArchiveMode carries the main bool value for mode on/off
- XLogArchiveDest might also be used, though best to use the mode

Thanks for looking through the code...

Best Regards, Simon Riggs

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-15 09:47:50
Message-ID:	40F652C6.4070007@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

I tried what I thought was a straightforward scenario, and seem to have
broken it :-(

Here is the little tale

1) initdb
2) set archive_mode and archive_dest in postgresql.conf
3) startup
4) create database called 'test'
5) connect to 'test' and type 'checkpoint'
6) backup PGDATA using 'tar -zcvf'
7) create tables in 'test' and add data using COPY (exactly 2 logs worth)
8) shutdown and remove PGDATA
9) recover using 'tar -zxvf'
10) copy recovery.conf into PGDATA
11) startup

This is what I get :

LOG: database system was interrupted at 2004-07-15 21:24:04 NZST
LOG: recovery command file found...
LOG: restore_program = cp %s/%s %s
LOG: recovery_target_inclusive = true
LOG: recovery_debug_log = true
LOG: starting archive recovery
LOG: restored log file "0000000000000000" from archive
LOG: checkpoint record is at 0/A48054
LOG: redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 496; next OID: 25419
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/A48094
LOG: restored log file "0000000000000001" from archive
LOG: record with zero length at 0/1FFFFE0
LOG: redo done at 0/1FFFF30
LOG: restored log file "0000000000000001" from archive
LOG: restored log file "0000000000000001" from archive
PANIC: concurrent transaction log activity while database system is
shutting down
LOG: startup process (PID 13492) was terminated by signal 6
LOG: aborting startup due to startup process failure

The concurrent access is a bit of a puzzle, as this is my home machine
(i.e. I am *sure* noone else is connected!)

Mark

P.s : CVS HEAD from about 1 hour ago, PITR 5.2, FreeBSD 4.10 on x86

From:	HISADAMasaki <hisada(dot)masaki(at)lab(dot)ntt(dot)co(dot)jp>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-15 12:16:58
Message-ID:	20040715210816.E8A3.HISADA.MASAKI@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Dear Simon,

I've just tested pitr_v5_2.patch and got an error message
during archiving process as follows.

-- begin
LOG: archive command="cp /usr/local/pgsql/data/pg_xlog/0000000000000000 /tmp",return code=-1
-- end

The command called in system(3) works, but it returns -1.
system(3) can not get right exit code from its child process,
when SIGCHLD is set as SIG_IGN.

So I did following change to pgarch_Main() in pgarch.c

-- line 236 ---
- pgsignal(SIGCHLD, SIG_IGN);

-- line 236 ---
+ pgsignal(SIGCHLD, SIG_DFL);

After that,
the error message doen't come out and it seems to be working propery.

Regards,
Hisada, Masaki

On Wed, 14 Jul 2004 00:13:37 +0100
Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

--
HISADA, Masaki <hisada(dot)masaki(at)lab(dot)ntt(dot)co(dot)jp>

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-15 19:35:27
Message-ID:	1089920126.17493.6322.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Thu, 2004-07-15 at 10:47, Mark Kirkwood wrote:
> I tried what I thought was a straightforward scenario, and seem to have
> broken it :-(
>
> Here is the little tale
>
> 1) initdb
> 2) set archive_mode and archive_dest in postgresql.conf
> 3) startup
> 4) create database called 'test'
> 5) connect to 'test' and type 'checkpoint'
> 6) backup PGDATA using 'tar -zcvf'
> 7) create tables in 'test' and add data using COPY (exactly 2 logs worth)
> 8) shutdown and remove PGDATA
> 9) recover using 'tar -zxvf'
> 10) copy recovery.conf into PGDATA
> 11) startup
>
> This is what I get :
>
> LOG: database system was interrupted at 2004-07-15 21:24:04 NZST
> LOG: recovery command file found...
> LOG: restore_program = cp %s/%s %s
> LOG: recovery_target_inclusive = true
> LOG: recovery_debug_log = true
> LOG: starting archive recovery
> LOG: restored log file "0000000000000000" from archive
> LOG: checkpoint record is at 0/A48054
> LOG: redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE
> LOG: next transaction ID: 496; next OID: 25419
> LOG: database system was not properly shut down; automatic recovery in
> progress
> LOG: redo starts at 0/A48094
> LOG: restored log file "0000000000000001" from archive
> LOG: record with zero length at 0/1FFFFE0
> LOG: redo done at 0/1FFFF30
> LOG: restored log file "0000000000000001" from archive
> LOG: restored log file "0000000000000001" from archive
> PANIC: concurrent transaction log activity while database system is
> shutting down
> LOG: startup process (PID 13492) was terminated by signal 6
> LOG: aborting startup due to startup process failure
>
> The concurrent access is a bit of a puzzle, as this is my home machine
> (i.e. I am *sure* noone else is connected!)

First, thanks for sticking with it to test this.

I've not received such a message myself - this is interesting.

Is it possible to copy that directory to one side and re-run the test?
Add another parameter in postgresql.conf called "archive_debug = true"
Does it happen identically the second time?

What time difference was there between steps 5 and 6? I think I can here
Andreas saying "told you".... I'm thinking the backup might be somehow
corrupted because the checkpoint occurred during the backup. Hmmm...

Could you also post me the recovery.log file? (don't post to list)

Thanks, Simon Riggs

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	HISADAMasaki <hisada(dot)masaki(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-15 22:44:02
Message-ID:	1089931442.17493.6749.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote:
> Dear Simon,
>
> I've just tested pitr_v5_2.patch and got an error message
> during archiving process as follows.
>
> -- begin
> LOG: archive command="cp /usr/local/pgsql/data/pg_xlog/0000000000000000 /tmp",return code=-1
> -- end
>
> The command called in system(3) works, but it returns -1.
> system(3) can not get right exit code from its child process,
> when SIGCHLD is set as SIG_IGN.
>
> So I did following change to pgarch_Main() in pgarch.c
>
> -- line 236 ---
> - pgsignal(SIGCHLD, SIG_IGN);
>
> -- line 236 ---
> + pgsignal(SIGCHLD, SIG_DFL);
>

Thank you for testing the patch. Very much appreciated.

I was aware of the potential issues of incorrect return codes, and that
exact part of the code is the part I'm least happy with.

I'm not sure I understand why its returned -1, though I'll take you
recommendation. I've not witnessed such an issue. What system are you
running, or is it a default shell issue?

Do people think that the change is appropriate for all systems, or just
the one you're using?

Best Regards, Simon Riggs

From:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	HISADAMasaki <hisada(dot)masaki(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-15 23:01:33
Message-ID:	20040715230133.GB8005@dcc.uchile.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Thu, Jul 15, 2004 at 11:44:02PM +0100, Simon Riggs wrote:
> On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote:

> > -- line 236 ---
> > - pgsignal(SIGCHLD, SIG_IGN);
> >
> > -- line 236 ---
> > + pgsignal(SIGCHLD, SIG_DFL);
>
> I'm not sure I understand why its returned -1, though I'll take you
> recommendation. I've not witnessed such an issue. What system are you
> running, or is it a default shell issue?
>
> Do people think that the change is appropriate for all systems, or just
> the one you're using?

My manpage for signal(2) says that you shouldn't assign SIG_IGN to
SIGCHLD, according to POSIX. It goes on to say that BSD and SysV
behaviors differ on this aspect.

(This is on linux BTW)

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-15 23:13:20
Message-ID:	40F70F90.4050607@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs wrote:

>
>First, thanks for sticking with it to test this.
>
>I've not received such a message myself - this is interesting.
>
>Is it possible to copy that directory to one side and re-run the test?
>Add another parameter in postgresql.conf called "archive_debug = true"
>Does it happen identically the second time?
>
>
>
Yes, identical results - I re-initdb'ed and ran the process again,
rather than reuse the files.

>What time difference was there between steps 5 and 6? I think I can here
>Andreas saying "told you".... I'm thinking the backup might be somehow
>corrupted because the checkpoint occurred during the backup. Hmmm...
>
>
>
I was wondering about this, so left a bit more time in between, and
forced a sync as well for good measure.

5) $ psql -d test -c "checkpoint"; sleep 30;sync;sleep 30
6) $ tar -zcvf /data1/dump/pgdata-7.5.tar.gz *

>
>Thanks, Simon Riggs
>
>
>

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-15 23:46:54
Message-ID:	40F7176E.4000001@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs wrote:

>
>So far:
>
>I've tried to re-create the problem as exactly as I can, but it works
>for me.
>
>This is clearly an important case to chase down.
>
>I assume that this is the very first time you tried recovery? Second and
>subsequent recoveries using the same set have a potential loophole,
>which we have been discussing.
>
>Right now, I'm thinking that the "exactly 2 logs worth" of data has
>brought you very close to the end of the log file (FFFFE0) ending with 1
>and the shutdown checkpoint that is then subsequently written is
>failing.
>
>Can you repeat this your end?
>
>
>
It is repeatable at my end. It is actually fairly easy to recreate the
example I am using, download

http://sourceforge.net/projects/benchw

and generate the dataset for Pg - but trim the large "fact0.dat" dump
file using head -100000.
Thus step 7 consists of creating the 4 tables and COPYing in the data
for them.

>The nearest I can get to the exact record pointers you show are to start
>recovery at A4807C and to end at with FFFF88.
>
>Overall, PITR changes the recovery process very little, if at all. The
>main areas of effect are to do with sequencing of actions and matching
>up the right logs with the right backup. I'm not looking for bugs in the
>code but in subtle side-effects and "edge" cases. Everything you can
>tell me will help me greatly in chasing that down.
>
>
>
I agree - I will try this sort of example again, but will change the
number of rows I am COPYing (currently 100000) and see if that helps.

>Best Regards, Simon Riggs
>
>
>

By way of contrast, using the *same* procedure (1-11), but generating 2
logs worth of INSERTS/UPDATES using 10 concurrent process *works fine* -
e.g :

LOG: database system was interrupted at 2004-07-16 11:17:52 NZST
LOG: recovery command file found...
LOG: restore_program = cp %s/%s %s
LOG: recovery_target_inclusive = true
LOG: recovery_debug_log = true
LOG: starting archive recovery
LOG: restored log file "0000000000000000" from archive
LOG: checkpoint record is at 0/A4803C
LOG: redo record is at 0/A4803C; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 496; next OID: 25419
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/A4807C
postmaster starting
[postgres(at)shroudeater 7.5]$ LOG: restored log file "0000000000000001"
from archive
cp: cannot stat `/data1/pgdata/7.5-archive/0000000000000002': No such
file or directory
LOG: could not restore "0000000000000002" from archive
LOG: could not open file "/data1/pgdata/7.5/pg_xlog/0000000000000002"
(log file 0, segment 2): No such file or directory
LOG: redo done at 0/1FFFFD4
LOG: archive recovery complete
LOG: database system is ready
LOG: archiver started

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
Cc:	HISADAMasaki <hisada(dot)masaki(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-15 23:51:53
Message-ID:	1089935513.17493.6923.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote:
> On Thu, Jul 15, 2004 at 11:44:02PM +0100, Simon Riggs wrote:
> > On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote:
>
> > > -- line 236 ---
> > > - pgsignal(SIGCHLD, SIG_IGN);
> > >
> > > -- line 236 ---
> > > + pgsignal(SIGCHLD, SIG_DFL);
> >
> > I'm not sure I understand why its returned -1, though I'll take you
> > recommendation. I've not witnessed such an issue. What system are you
> > running, or is it a default shell issue?
> >
> > Do people think that the change is appropriate for all systems, or just
> > the one you're using?
>
> My manpage for signal(2) says that you shouldn't assign SIG_IGN to
> SIGCHLD, according to POSIX. It goes on to say that BSD and SysV
> behaviors differ on this aspect.
>

POSIX rules OK!

So - I should be setting this to SIG_DFL and thats good for everyone?

OK. Will do.

Best regards, Simon Riggs

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-16 00:02:38
Message-ID:	1089936157.17493.6958.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Fri, 2004-07-16 at 00:46, Mark Kirkwood wrote:

>
> By way of contrast, using the *same* procedure (1-11), but generating 2
> logs worth of INSERTS/UPDATES using 10 concurrent process *works fine* -
> e.g :
>

Great...at least we have shown that something works (or can work) and
have begun to isolate the problem whatever it is.

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, HISADAMasaki <hisada(dot)masaki(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-16 03:49:04
Message-ID:	15402.1089949744@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote:
>> My manpage for signal(2) says that you shouldn't assign SIG_IGN to
>> SIGCHLD, according to POSIX.

> So - I should be setting this to SIG_DFL and thats good for everyone?

Yeah, we learned the same lesson in the backend not too many releases
back. SIG_IGN'ing SIGCHLD is bad voodoo; it'll work on some platforms
but not others.

You could do worse than to look at the existing handling of signals in
the postmaster and its children; that code has been beat on pretty
heavily ...

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, HISADAMasaki <hisada(dot)masaki(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Point in Time Recovery
Date:	2004-07-16 07:40:44
Message-ID:	1089963643.17493.7882.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Fri, 2004-07-16 at 04:49, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote:
> >> My manpage for signal(2) says that you shouldn't assign SIG_IGN to
> >> SIGCHLD, according to POSIX.
>
> > So - I should be setting this to SIG_DFL and thats good for everyone?
>
> Yeah, we learned the same lesson in the backend not too many releases
> back. SIG_IGN'ing SIGCHLD is bad voodoo; it'll work on some platforms
> but not others.

Many thanks all, Best Regards Simon Riggs

From:	Gaetano Mendola <mendola(at)bigfoot(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-17 09:39:07
Message-ID:	40F8F3BB.8030002@bigfoot.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs wrote:

> On Wed, 2004-07-14 at 03:31, Christopher Kings-Lynne wrote:
>
>>Can you give us some suggestions of what kind of stuff to test? Is
>>there a way we can artificially kill the backend in all sorts of nasty
>>spots to see if recovery works? Does kill -9 simulate a 'power off'?
>>
>
>
> I was hoping some fiendish plans would be presented to me...
>
> But please start with "this feels like typical usage" and we'll go from
> there...the important thing is to try the first one.
>
> I've not done power off tests, yet. They need to be done just to
> check...actually you don't need to do this to test PITR...
>
> We need to exhaustive tests of...
> - power off
> - scp and cross network copies
> - all the permuted recovery options
> - archive_mode = off (i.e. current behaviour)
> - deliberately incorrectly set options (idiot-proof testing)

If you write also how to perform these tests it's also good in order to show
which problem PITR is addressing, I mean I know that is addressing a power off
but how I will recover it ?

Regards
Gaetano Mendola

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-17 18:53:46
Message-ID:	16031.1090090426@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

[ ... some desultory reading of PITR patch ... ]

What is the point of having both archive_program and archive_dest as
GUC variables? Wouldn't it be simpler to fold them into one parameter,
viz

archive_command = 'cp %s /archivedir'

For that matter, do we need a separate archive_mode boolean? The one
thing I can positively guarantee about archive_dest (or archive_command)
is that we cannot come up with a useful default for it (no, /tmp isn't
good). Therefore it does not seem very reasonable to let the user turn
on archiving without having explicitly specified an archive destination.

I propose that we fold all three GUC flags into a single archive_command
string whose built-in default is an empty string, and you enable
archiving by setting it to something nonempty.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-17 19:08:43
Message-ID:	200407171908.i6HJ8hc06808@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Tom Lane wrote:
> [ ... some desultory reading of PITR patch ... ]
>
> What is the point of having both archive_program and archive_dest as
> GUC variables? Wouldn't it be simpler to fold them into one parameter,
> viz
>
> archive_command = 'cp %s /archivedir'
>
> For that matter, do we need a separate archive_mode boolean? The one
> thing I can positively guarantee about archive_dest (or archive_command)
> is that we cannot come up with a useful default for it (no, /tmp isn't
> good). Therefore it does not seem very reasonable to let the user turn
> on archiving without having explicitly specified an archive destination.

I assume archive_dest is used for both archive and recovery of archives.

> I propose that we fold all three GUC flags into a single archive_command
> string whose built-in default is an empty string, and you enable
> archiving by setting it to something nonempty.

I think the idea is that you would turn archiving on and off regularly
while you might never change the archive_command value. Also, how would
you disable it? Set it to "", and if you do, you then have not way to
remember your command string when you want to re-enable it.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-17 19:18:26
Message-ID:	16190.1090091906@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> What is the point of having both archive_program and archive_dest as
>> GUC variables?

> I assume archive_dest is used for both archive and recovery of archives.

You assume wrong; it's not used there. There isn't any real good
reason to suppose that the recovery process is going to fetch the files
from exactly where archiving put them, anyhow.

> I think the idea is that you would turn archiving on and off regularly

Why in the world would you do that? People who want PITR at all will
want it 24x7.

> while you might never change the archive_command value. Also, how would
> you disable it? Set it to "", and if you do, you then have not way to
> remember your command string when you want to re-enable it.

Leave the original value in a comment, if you're going to want it again
later.

I don't think any of the above arguments outweigh the risk of people
shooting themselves in the foot by enabling archive_mode without
specifying a proper command/destination.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-17 20:15:21
Message-ID:	200407172015.i6HKFLK17349@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Tom Lane wrote:
> >> What is the point of having both archive_program and archive_dest as
> >> GUC variables?
>
> > I assume archive_dest is used for both archive and recovery of archives.
>
> You assume wrong; it's not used there. There isn't any real good
> reason to suppose that the recovery process is going to fetch the files
> from exactly where archiving put them, anyhow.
>
> > I think the idea is that you would turn archiving on and off regularly
>
> Why in the world would you do that? People who want PITR at all will
> want it 24x7.
>
> > while you might never change the archive_command value. Also, how would
> > you disable it? Set it to "", and if you do, you then have not way to
> > remember your command string when you want to re-enable it.
>
> Leave the original value in a comment, if you're going to want it again
> later.
>
> I don't think any of the above arguments outweigh the risk of people
> shooting themselves in the foot by enabling archive_mode without
> specifying a proper command/destination.

So you want to merge them all into a single command string. That does
seem less error-prone. I see a few variables that turn off
when set to '' like unix_socket_*. How would this command string work?
How do you specify the WAL file name to transfer?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-18 05:04:43
Message-ID:	25995.1090127083@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> So you want to merge them all into a single command string. That does
> seem less error-prone. I see a few variables that turn off
> when set to '' like unix_socket_*. How would this command string work?
> How do you specify the WAL file name to transfer?

No different from before, necessarily. However I did not like the
restriction to a single %s in the submitted implementation. What I
have in my local copy is
%p -> full path of XLOG file to be archived
%f -> base name of XLOG file to be archived
and the suggested example becomes
archive_command = 'cp %p /mnt/server/pgarchive/%f'

Note that this example immediately eliminates one of the failure modes
Simon enumerates in his README, which is to try 'cp %s /foo' where /foo
isn't a directory. More generally, though, *only* a cp-to-directory
solution is likely to be very happy with not being able to get at the
base file name. Yes you can make a shellscript and use basename,
but I don't think you should have to do that if it could otherwise
be a one-liner.

(In case it's not obvious from the above, I am hacking with intent to
commit soon. Maybe tomorrow, if my wife doesn't make me paint the
bathroom instead...)

regards, tom lane

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-18 05:50:11
Message-ID:	40FA0F93.8010607@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

I decided to produce a nice simple example, so that anyone could
hopefully replicate what I am seeing.

The scenario is the same as before (the 11 steps), but the CREATE TABLE
and COPY step has been reduced to:

CREATE TABLE test0 (filler VARCHAR(120));
COPY test0 FROM '/data0/dump/test0.dat' USING DELIMITERS ',';

Now the file 'test0.dat' consists of (128293) identical lines, each of
109 'a' charactors (plus end of line)

A script to run the whole business can be found here :

http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz

(It will need a bit of editing for things like location of Pg, PGDATA,
and you will need to make your own data file)

The main points of interest are:
- anything <=128392 rows in test0.dat results in 1 archived log, and the
recovery succeeds
- anything >=128393 rows in test0.dat results in 2 or more archived
logs, and recovery fails on the second log (and gives the zero length
redo at 0/1FFFFE0 message).

Let me know if I can do any more legwork on this (I am considering
re-compiling with WAL_DEBUG now that example is simpler)

regards

Mark

Simon Riggs wrote:

>On Thu, 2004-07-15 at 10:47, Mark Kirkwood wrote:
>
>
>>I tried what I thought was a straightforward scenario, and seem to have
>>broken it :-(
>>
>>Here is the little tale
>>
>>1) initdb
>>2) set archive_mode and archive_dest in postgresql.conf
>>3) startup
>>4) create database called 'test'
>>5) connect to 'test' and type 'checkpoint'
>>6) backup PGDATA using 'tar -zcvf'
>>7) create tables in 'test' and add data using COPY (exactly 2 logs worth)
>>8) shutdown and remove PGDATA
>>9) recover using 'tar -zxvf'
>>10) copy recovery.conf into PGDATA
>>11) startup
>>
>>This is what I get :
>>
>>LOG: database system was interrupted at 2004-07-15 21:24:04 NZST
>>LOG: recovery command file found...
>>LOG: restore_program = cp %s/%s %s
>>LOG: recovery_target_inclusive = true
>>LOG: recovery_debug_log = true
>>LOG: starting archive recovery
>>LOG: restored log file "0000000000000000" from archive
>>LOG: checkpoint record is at 0/A48054
>>LOG: redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE
>>LOG: next transaction ID: 496; next OID: 25419
>>LOG: database system was not properly shut down; automatic recovery in
>>progress
>>LOG: redo starts at 0/A48094
>>LOG: restored log file "0000000000000001" from archive
>>LOG: record with zero length at 0/1FFFFE0
>>LOG: redo done at 0/1FFFF30
>>LOG: restored log file "0000000000000001" from archive
>>LOG: restored log file "0000000000000001" from archive
>>PANIC: concurrent transaction log activity while database system is
>>shutting down
>>LOG: startup process (PID 13492) was terminated by signal 6
>>LOG: aborting startup due to startup process failure
>>
>>The concurrent access is a bit of a puzzle, as this is my home machine
>>(i.e. I am *sure* noone else is connected!)
>>
>>
>>
>
>I can see what is wrong now, but you'll have to help me on details your
>end...
>
>The log shows that xlog 1 was restored from archive. It contains a zero
>length record, which indicates that it isn't yet full (or thats what the
>existing recovery code assumes it means). Which also indicates that it
>should never have been archived in the first place, and should not
>therefore be a candidate for a restore from archive.
>
>The double message "restored log file" can only occur after you've
>retrieved a partially full file from archive - which as I say, shouldn't
>be there.
>
>Other messages are essentially spurious in those circumstances.
>
>Either:
>- somehow the files have been mixed up in the archive directory, which
>is possible if the filing discipline is not strict - various ways,
>unfortunately I would guess this to be the most likely, somehow
>- the file that has been restored has been damaged in some way
>- the archiver has archived a file too early (very unlikely, IMHO -
>thats the most robust bit of the code)
>- some aspect of the code has written a zero length record to WAL (which
>is supposed to not be possible, but we musn't discount an error in
>recent committed work)
>
>- there may also be an effect going on with checkpoints that I don't
>understand...spurious checkpoint warning messages have already been
>observed and reported,
>
>Best regards, Simon Riggs
>
>
>
>
>
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-18 06:15:34
Message-ID:	28105.1090131334@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Mark Kirkwood <markir(at)coretech(dot)co(dot)nz> writes:
> - anything >=128393 rows in test0.dat results in 2 or more archived
> logs, and recovery fails on the second log (and gives the zero length
> redo at 0/1FFFFE0 message).

Zero length record is not an error, it's the normal way of detecting
end-of-log.

regards, tom lane

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-18 06:48:09
Message-ID:	40FA1D29.1040100@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

There are some silly bugs in the script:

- forgot to export PGDATA and PATH after changing them
- forgot to mention the need to edit test.sql (COPY line needs path to
dump file)

Apologies - I will submit a fixed version a little later

regards

Mark

Mark Kirkwood wrote:

>
> A script to run the whole business can be found here :
>
> http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz
>
> (It will need a bit of editing for things like location of Pg, PGDATA,
> and you will need to make your own data file)
>

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-18 08:05:42
Message-ID:	40FA2F56.3080306@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

fixed.

Mark Kirkwood wrote:

> There are some silly bugs in the script:
>
> - forgot to export PGDATA and PATH after changing them
> - forgot to mention the need to edit test.sql (COPY line needs path to
> dump file)
>
> Apologies - I will submit a fixed version a little later
>
> regards
>
> Mark
>
> Mark Kirkwood wrote:
>
>>
>> A script to run the whole business can be found here :
>>
>> http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz
>>
>> (It will need a bit of editing for things like location of Pg,
>> PGDATA, and you will need to make your own data file)
>>
>

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-18 20:20:52
Message-ID:	1090182052.17493.18904.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Sun, 2004-07-18 at 06:04, Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > So you want to merge them all into a single command string. That does
> > seem less error-prone. I see a few variables that turn off
> > when set to '' like unix_socket_*. How would this command string work?
> > How do you specify the WAL file name to transfer?
>

GUC-wise, I implemented what we agreed in discussions...

There are many things in need of refactoring, so my focus was on
delivering what we agreed, even knowing it would probably change...

A few notes on the patch (as I submitted it - so as not to confuse with
other versions being worked upon)
- archive_dest is definitely used in both archive and recovery. There
wasn't much need for this GUC apart from that and I think we are better
off without it. Removing it improves recovery flexibility (we cannot
assume the recovery is taking place in anything like the original
configuration).

- archive_mode I would prefer to keep - it is explicit then which mode
you are in, rather than implicit from the command string. In all other
ways I agree with everything Tom has said. It allows us to talk about
"being in archive_mode" without people saying "but I can't work out how
to turn archive mode on".

When archiver starts the FIRST thing it does is run a test to confirm
that the command string works, so setting archive_command to '' would
simply generate an error.

Also, I would suggest this:
- changing archive mode requires a postmaster restart
- changing archive command should just be a SIGHUP...we don't want to
force a restart just to switch to a new kind of archiving

If you can only change archive_program at postmaster start that is
restrictive, but making that SIGHUP would allow people to set it to ''
and turn off archiving while postmaster is up == lurking fault.

> No different from before, necessarily. However I did not like the
> restriction to a single %s in the submitted implementation. What I
> have in my local copy is
> %p -> full path of XLOG file to be archived
> %f -> base name of XLOG file to be archived
> and the suggested example becomes
> archive_command = 'cp %p /mnt/server/pgarchive/%f'
>

I'm happy with those changes and would have done them myself given
time... the 2 or 3 %s parameters wasn't the most user friendly way of
doing it.

> Note that this example immediately eliminates one of the failure modes
> Simon enumerates in his README, which is to try 'cp %s /foo' where /foo
> isn't a directory. More generally, though, *only* a cp-to-directory
> solution is likely to be very happy with not being able to get at the
> base file name. Yes you can make a shellscript and use basename,
> but I don't think you should have to do that if it could otherwise
> be a one-liner.
>

Good.

> (In case it's not obvious from the above, I am hacking with intent to
> commit soon. Maybe tomorrow, if my wife doesn't make me paint the
> bathroom instead...)
>
...just returned from there... :)

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	markw(at)osdl(dot)org, pgman(at)candle(dot)pha(dot)pa(dot)us, kn(at)mgnet(dot)de, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 03:03:01
Message-ID:	8063.1090206181@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Latest version, pitr_v5_2.patch...

Reviewed and committed with some adjustments.

I see the following significant loose ends:

* Documentation is, um, lacking. (One point in particular is that I
inserted the recovery.conf.sample file into CVS, but did not fill in
the patch's lack of attempt to install it anywhere.)

* As Bruce has pointed out already, the process of making a backup
needs some improvements for more safety: the starting and ending WAL
offsets have got to be recorded somehow.

* As I have pointed out already, we need to invent "timelines" to
allow incompatible WAL segments to exist side-by-side. I will volunteer
to look into this.

* I think creating a .ready file during XLogFileOpen is completely bogus,
for reasons mentioned in committed comments (look for XXX). Possibly
this can go away with timelines.

* I am wondering if it wouldn't be a good idea to remove the local copy
of any segment we successfully obtain from archive. The existing
comments note that we might get a wrong or corrupted file from archive,
but aren't we in at least as much risk of using an obsolete segment
restored from backup if we leave the local segment in place? (The
archive recovery run itself will know not to do this, but if we crash
shortly thereafter, the ensuing recovery run would NOT know not to
trust such files.)

Perhaps the last point is really a backup-process issue. AFAICS there
is no good reason for a backup tarfile to include $PGDATA/pg_xlog at
all, and some good reasons for it not to. Can we redesign either the
backup process or the disk layout so that that will not happen? Then
we could stop worrying about stale local pg_xlog files.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 03:13:50
Message-ID:	8192.1090206830@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> When archiver starts the FIRST thing it does is run a test to confirm
> that the command string works, so setting archive_command to '' would
> simply generate an error.

No, it would do no such thing; the test cannot really tell anything more
than whether system("foo") returns zero ... and at least on my machine,
system("") returns zero. It certainly does not prove that any data went
to anyplace safe.

I diked that test out of the committed patch because I felt it cluttered
the archive area without actually proving anything of interest. We can
revisit the point if you like.

> Also, I would suggest this:
> - changing archive mode requires a postmaster restart

Why?

> - changing archive command should just be a SIGHUP...

Check, as committed [and tested to work...]

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 03:14:10
Message-ID:	200407190314.i6J3EAJ07392@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

What is the process of logging to tape? Ideally we could just do 'dd'
to the tape drive in append mode; however we need a way of signalling
that we want to change tapes.

The only method I can think of is to have PITR dump the files into a
holding directory, and have a daemon that scans the directory and writes
files to tape when they are completely copied (how do we detect that?
Use 'mv' after the copy? Seems like a good use for our new %
parameters). Then we need a control program to signal the daemon to
stop archiving to tape, have it set a flag file so we know it is
suspended tape writes, report that back to the client, change tapes,
then tell it to restart.

I am asking to make sure we don't need a PITR pause mode that prevents
WAL files from being archived but also prevents them from being
recycled. If we did that, we could probably append to tape directly,
but then we need to go into 'pause archive" mode in the PITR process,
and such switching seems like a pain and the wrong place to do it.

---------------------------------------------------------------------------

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > So you want to merge them all into a single command string. That does
> > seem less error-prone. I see a few variables that turn off
> > when set to '' like unix_socket_*. How would this command string work?
> > How do you specify the WAL file name to transfer?
>
> No different from before, necessarily. However I did not like the
> restriction to a single %s in the submitted implementation. What I
> have in my local copy is
> %p -> full path of XLOG file to be archived
> %f -> base name of XLOG file to be archived
> and the suggested example becomes
> archive_command = 'cp %p /mnt/server/pgarchive/%f'
>
> Note that this example immediately eliminates one of the failure modes
> Simon enumerates in his README, which is to try 'cp %s /foo' where /foo
> isn't a directory. More generally, though, *only* a cp-to-directory
> solution is likely to be very happy with not being able to get at the
> base file name. Yes you can make a shellscript and use basename,
> but I don't think you should have to do that if it could otherwise
> be a one-liner.
>
> (In case it's not obvious from the above, I am hacking with intent to
> commit soon. Maybe tomorrow, if my wife doesn't make me paint the
> bathroom instead...)
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, markw(at)osdl(dot)org, kn(at)mgnet(dot)de, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 03:21:31
Message-ID:	200407190321.i6J3LVY08349@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > Latest version, pitr_v5_2.patch...
>
> Reviewed and committed with some adjustments.
>
> I see the following significant loose ends:
>
> * Documentation is, um, lacking. (One point in particular is that I
> inserted the recovery.conf.sample file into CVS, but did not fill in
> the patch's lack of attempt to install it anywhere.)

I figure it should go in share like the other sample files, and tell
people to copy it to /data and modify it for recovery.

> * As Bruce has pointed out already, the process of making a backup
> needs some improvements for more safety: the starting and ending WAL
> offsets have got to be recorded somehow.

Yep, we need those files in the archive location and the /data directory
tarball.

> * As I have pointed out already, we need to invent "timelines" to
> allow incompatible WAL segments to exist side-by-side. I will volunteer
> to look into this.

Great.

> * I think creating a .ready file during XLogFileOpen is completely bogus,
> for reasons mentioned in committed comments (look for XXX). Possibly
> this can go away with timelines.
>
> * I am wondering if it wouldn't be a good idea to remove the local copy
> of any segment we successfully obtain from archive. The existing
> comments note that we might get a wrong or corrupted file from archive,
> but aren't we in at least as much risk of using an obsolete segment
> restored from backup if we leave the local segment in place? (The
> archive recovery run itself will know not to do this, but if we crash
> shortly thereafter, the ensuing recovery run would NOT know not to
> trust such files.)

> Perhaps the last point is really a backup-process issue. AFAICS there
> is no good reason for a backup tarfile to include $PGDATA/pg_xlog at
> all, and some good reasons for it not to. Can we redesign either the
> backup process or the disk layout so that that will not happen? Then
> we could stop worrying about stale local pg_xlog files.

Seems we should just clear out the /pg_xlog directory before we start
recovery. We are going to rename recovery.conf to recovery.in-progress
or something to prevent us from clearing out the directory after a
crash, right? (I see you rename recovery.conf to recovery.done. Is
that wise? I thought we would disable recovery after a crash, or does
it just keep going? If so, nice.)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, markw(at)osdl(dot)org, kn(at)mgnet(dot)de, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 03:49:22
Message-ID:	8453.1090208962@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> * Documentation is, um, lacking. (One point in particular is that I
>> inserted the recovery.conf.sample file into CVS, but did not fill in
>> the patch's lack of attempt to install it anywhere.)

> I figure it should go in share like the other sample files, and tell
> people to copy it to /data and modify it for recovery.

It should certainly go to /share as a .sample file. I was thinking that
initdb should perhaps copy it into $PGDATA (still as .sample, not as
.conf!) so it'd be right there when you need it.

>> Perhaps the last point is really a backup-process issue. AFAICS there
>> is no good reason for a backup tarfile to include $PGDATA/pg_xlog at
>> all, and some good reasons for it not to.

> Seems we should just clear out the /pg_xlog directory before we start
> recovery.

No, that's a horrid idea, because it loses the ability to combine
archival xlog files with recent files in /pg_xlog that are not yet
archived. We need to distinguish old files that were accidentally
captured by backup from very-recent files. I think the cleanest way to
do that is for backup not to capture them in the first place.

> We are going to rename recovery.conf to recovery.in-progress
> or something to prevent us from clearing out the directory after a
> crash, right?

I had second thoughts about that and didn't do it in the committed
patch, though it's certainly still open for debate.

> (I see you rename recovery.conf to recovery.done. Is
> that wise?

Yes. Once you've done with a PITR recovery you definitely do *not* want
a subsequent crash recovery to think it should obey your recovery_target
limit. But if you fail before you've finished the recovery run it
should theoretically be okay to retry, so I didn't add code to rename to
"recovery.inprogress". We can certainly add it later if we decide it's
a good idea.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 04:54:13
Message-ID:	8886.1090212853@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> What is the process of logging to tape? Ideally we could just do 'dd'
> to the tape drive in append mode; however we need a way of signalling
> that we want to change tapes.

The reason we use a user-specifiable shell command for archiving is
so that we do not have to answer the above ;-). It's the user's problem
to write a shell script that does things the way he wants. He can make
it connect to /dev/tty and ask the operator to swap tapes, or whatever.

Personally I am very accustomed to Hewlett-Packard's disk-to-tape backup
program "fbackup", which allows you to provide a shell script to handle
exactly this sort of thing, and it's worked well for me for many years.

> I am asking to make sure we don't need a PITR pause mode that prevents
> WAL files from being archived but also prevents them from being
> recycled.

WAL files will not be recycled until the archiver daemon has set a .done
flag file for them, so I see no problem here. (Note: I took out some
code in Simon's original patch that would start bleating on the basis
of totally unsupportable assumptions about long archival of a log
segment "ought to" take.)

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 06:39:03
Message-ID:	1090219142.17493.22023.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Mon, 2004-07-19 at 04:13, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > When archiver starts the FIRST thing it does is run a test to confirm
> > that the command string works, so setting archive_command to '' would
> > simply generate an error.
>
> No, it would do no such thing; the test cannot really tell anything more
> than whether system("foo") returns zero ... and at least on my machine,
> system("") returns zero. It certainly does not prove that any data went
> to anyplace safe.
>
> I diked that test out of the committed patch because I felt it cluttered
> the archive area without actually proving anything of interest. We can
> revisit the point if you like.
>

If the test doesn't guarantee success, then it needs to go....

Thanks for removing it.

Best Regards, Simon Riggs

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	markw(at)osdl(dot)org, pgman(at)candle(dot)pha(dot)pa(dot)us, kn(at)mgnet(dot)de, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 07:35:05
Message-ID:	1090222505.17493.22360.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Mon, 2004-07-19 at 04:03, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > Latest version, pitr_v5_2.patch...
>
> Reviewed and committed with some adjustments.
>

Wow! Thanks very much - you work fast.

I'll be re-testing later today.

> I see the following significant loose ends:
>
> * Documentation is, um, lacking. (One point in particular is that I
> inserted the recovery.conf.sample file into CVS, but did not fill in
> the patch's lack of attempt to install it anywhere.)
>

Yes...wasn't sure what to do with that. Is everybody happy to install it
as a sample into the main Data Directory? (i.e. as recovery.conf.sample
rather than recovery.conf which would be a bad thing).

> * As Bruce has pointed out already, the process of making a backup
> needs some improvements for more safety: the starting and ending WAL
> offsets have got to be recorded somehow.
>

Haven't got to that yet, but will do.

> * As I have pointed out already, we need to invent "timelines" to
> allow incompatible WAL segments to exist side-by-side. I will volunteer
> to look into this.

Yes, discussing on the other thread.

>
> * I think creating a .ready file during XLogFileOpen is completely bogus,
> for reasons mentioned in committed comments (look for XXX). Possibly
> this can go away with timelines.

Yes, to some extent it would go away with timelines.

If you have a local copy at the end of a timeline that isn't archived,
then it seems a good idea to archive it, or at least copy it somewhere
safe. If you don't then you will not be able to revert to a full
recovery of that timeline in the future should you choose to do so.

The code and its location may be somewhat more suspect.... :)

>
> * I am wondering if it wouldn't be a good idea to remove the local copy
> of any segment we successfully obtain from archive. The existing
> comments note that we might get a wrong or corrupted file from archive,
> but aren't we in at least as much risk of using an obsolete segment
> restored from backup if we leave the local segment in place? (The
> archive recovery run itself will know not to do this, but if we crash
> shortly thereafter, the ensuing recovery run would NOT know not to
> trust such files.)
>

I agree they're a loose end that needs some thought.

I avoided that decision by going around the files. We originally agreed
that we would keep that data....reason was you can't tell whether the
files have been restored by a backup that forgot to exclude pg_xlog, or
that we are choosing to do a PITR recovery on an otherwise healthy
system (or as the comments explain maybe we lost everything except
pg_xlog).

If we crash during recovery it doesn't crash recover and restart.

If we crash after recovery, then the checkpoint record will have moved
forward and we so we don't then accidentally re-use those local copies.

Timelines will solve this...
>
> Perhaps the last point is really a backup-process issue. AFAICS there
> is no good reason for a backup tarfile to include $PGDATA/pg_xlog at
> all, and some good reasons for it not to. Can we redesign either the
> backup process or the disk layout so that that will not happen? Then
> we could stop worrying about stale local pg_xlog files.
>

Thats the way I saw it.

Seems fairly easy to say "don't backup pg_xlog", but you can't guarantee
they won't, even if you tell them not to...

What is stale today maybe considered to be actually your best option
when testing to see whether a recovery has achieved your objectives.

I'll read the who patch, your comments and test before I respond
further. Thanks for working so hard on this, so quickly.

Best Regards, Simon Riggs

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, markw(at)osdl(dot)org, kn(at)mgnet(dot)de, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 16:35:05
Message-ID:	200407191635.i6JGZ5A21033@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Tom Lane wrote:
> >> * Documentation is, um, lacking. (One point in particular is that I
> >> inserted the recovery.conf.sample file into CVS, but did not fill in
> >> the patch's lack of attempt to install it anywhere.)
>
> > I figure it should go in share like the other sample files, and tell
> > people to copy it to /data and modify it for recovery.
>
> It should certainly go to /share as a .sample file. I was thinking that
> initdb should perhaps copy it into $PGDATA (still as .sample, not as
> .conf!) so it'd be right there when you need it.

I think /share is best. I see other *.share file that aren't used until
you rename them and move them to the right directory, and
recovery.conf.sample seems the same. I think having the sample at the
top of data when for most people it will be unused is strange.

> >> Perhaps the last point is really a backup-process issue. AFAICS there
> >> is no good reason for a backup tarfile to include $PGDATA/pg_xlog at
> >> all, and some good reasons for it not to.
>
> > Seems we should just clear out the /pg_xlog directory before we start
> > recovery.
>
> No, that's a horrid idea, because it loses the ability to combine
> archival xlog files with recent files in /pg_xlog that are not yet
> archived. We need to distinguish old files that were accidentally
> captured by backup from very-recent files. I think the cleanest way to
> do that is for backup not to capture them in the first place.

I am confused. Aren't we always doing a restore from a backup? Are you
saying there are cases where we aren't and need the stuff in pg_xlog?
Are you saying we might have some new WAL files that we want to add to
pg_xlog before we do the restore, like the most recent WAL that wasn't
archived because it wasn't finished? Why would we be doing a recover if
we had such files? I see your point that we wouldn't know which file
to use, the archive version or the pg_xlog version, but actually
wouldn't the archive version always be preferred because we would know
it to be complete.

I don't see any reliable way to prevent people from having pg_xlog in
their backups seeing they might use snapshots, tar, etc.

> > We are going to rename recovery.conf to recovery.in-progress
> > or something to prevent us from clearing out the directory after a
> > crash, right?
>
> I had second thoughts about that and didn't do it in the committed
> patch, though it's certainly still open for debate.

How are we handling a crash during recovery?

> > (I see you rename recovery.conf to recovery.done. Is
> > that wise?
>
> Yes. Once you've done with a PITR recovery you definitely do *not* want
> a subsequent crash recovery to think it should obey your recovery_target
> limit. But if you fail before you've finished the recovery run it
> should theoretically be okay to retry, so I didn't add code to rename to
> "recovery.inprogress". We can certainly add it later if we decide it's
> a good idea.

Ah, OK, so it just keeps going. However, we don't know if what is in
pg_xlog was in the process of being copied from the archive at the time
of the crash, no? In fact I am wondering if we should be transfering
the archive files into temporary names than doing an 'mv' to make them
current so we don't get partial files in pg_xlog. However, we can't do
that because we are using a user-supplied command line. Should we pass
a fake name to the command string then do the 'mv' ourselves. With WAL
now, we do an fsync so we know the contents are crash-proof, but I am
not sure how to do that during recovery. I guess this gets back to how
to handle the contents of pg_xlog during recovery.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, markw(at)osdl(dot)org, kn(at)mgnet(dot)de, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 16:56:04
Message-ID:	23306.1090256164@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> It should certainly go to /share as a .sample file. I was thinking that
>> initdb should perhaps copy it into $PGDATA (still as .sample, not as
>> .conf!) so it'd be right there when you need it.

> I think /share is best.

Okay, we agree on that part at least; I'll take care of it. If anyone
wants to argue for further copying during initdb, that can be added
later.

> I am confused. Aren't we always doing a restore from a backup?

No. This code serves two purposes: recovery from archived WAL and
point-in-time recovery. You might want to do a PITR run at a time
where not all your WAL segments have been pushed to archive. Indeed
the latest one can never be so pushed, since it's unfinished. Suppose
you are trying to do PITR recovery to a time just a few minutes ago
that is still in the latest WAL segment --- there is simply not any
legal way to have that come from the archive.

So we can't simply zero out pg_xlog at the start of a PITR run, even
if there weren't a don't-destroy-data argument against it.

>> I had second thoughts about that and didn't do it in the committed
>> patch, though it's certainly still open for debate.

> How are we handling a crash during recovery?

Retry, perhaps. It doesn't seem any different from crash-during-recovery
in the non-archived scenario ...

> Ah, OK, so it just keeps going. However, we don't know if what is in
> pg_xlog was in the process of being copied from the archive at the time
> of the crash, no?

Nonissue. It goes into RECOVERYXLOG and we never assume that that's
initially good. See RestoreArchivedXLog().

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, markw(at)osdl(dot)org, kn(at)mgnet(dot)de, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 20:24:52
Message-ID:	200407192024.i6JKOrO16297@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Tom Lane wrote:
> >> It should certainly go to /share as a .sample file. I was thinking that
> >> initdb should perhaps copy it into $PGDATA (still as .sample, not as
> >> .conf!) so it'd be right there when you need it.
>
> > I think /share is best.
>
> Okay, we agree on that part at least; I'll take care of it. If anyone
> wants to argue for further copying during initdb, that can be added
> later.
>
> > I am confused. Aren't we always doing a restore from a backup?
>
> No. This code serves two purposes: recovery from archived WAL and
> point-in-time recovery. You might want to do a PITR run at a time
> where not all your WAL segments have been pushed to archive. Indeed
> the latest one can never be so pushed, since it's unfinished. Suppose
> you are trying to do PITR recovery to a time just a few minutes ago
> that is still in the latest WAL segment --- there is simply not any
> legal way to have that come from the archive.
>
> So we can't simply zero out pg_xlog at the start of a PITR run, even
> if there weren't a don't-destroy-data argument against it.

If we had some code that checks pg_xlog on recovery startup, it could
rename each pg_xlog file and then recover the file from the archive. If
it doesn't exist or is truncated, discard it. If it is the right size,
we need to check to see which one has a WAL eof-of-segment marker (we
have on of those, right?). This would seem to catch all the cases:

o file brought back by tar, but complete file in archive
o archive in process of writing during crash
o partially full file in pg_xlog

What it doesn't cover are cases where tar gets a partial copy of a
pg_xlog file but the file never made it to archive yet, and a new
pg_xlog file was created and we get some of that file too. In fact, the
backup could get holes in the pg_xlog file where the backup has zeros
but the real file had data added to it after the zeros:

in tar XXXXX 00000 XXXXX

real XXXXX XXXXX XXXXX

This could happen when file has this:

XXXXX 00000 00000

backup reads this:

XXXXX 00000

database writes this:

XXXXX XXXXX XXXXX

backup reads the remainder of the file:

XXXXX 00000 XXXXX

In this case the end-of-segment marker doesn't even help us, and their
might not be an archive copy of this because it didn't happen yet.

I think I see a solution. We are going to create a file during backup so
we know the wal offsets and xids. If we see that file, we know either
we have a restore of a backup or they currently running a backup. If we
tell them not to restore while a backup is running (seems pretty
obvious) we can then delete pg_xlog when the backup wal offset file
exists. In other cases, we know the WAL files are valid to use.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 20:45:34
Message-ID:	1090269932.28049.25.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Mon, 2004-07-19 at 05:54, Tom Lane wrote:
> code in Simon's original patch that would start bleating

Code that bleats? LOL :) (is that a new log level?)

Some of it was perhaps a little woolly....

You've made my day, Simon Riggs (still laughing)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, markw(at)osdl(dot)org, kn(at)mgnet(dot)de, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 21:24:06
Message-ID:	29939.1090272246@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> we need to check to see which one has a WAL eof-of-segment marker (we
> have on of those, right?).

No, we don't.

> I think I see a solution. We are going to create a file during backup so
> we know the wal offsets and xids. If we see that file, we know either
> we have a restore of a backup or they currently running a backup.

... or the last backup attempt failed, but they forgot to remove the
file it left. Or we are doing crash recovery after the system lost
power while a backup was running. Or half a dozen other obvious scenarios.

> If we tell them not to restore while a backup is running (seems pretty
> obvious) we can then delete pg_xlog when the backup wal offset file
> exists. In other cases, we know the WAL files are valid to use.

We're not deleting pg_xlog, period. IMHO it's too dangerous even to
have such a function in the code.

My original suggestion was to *replace* individual xlog files with data
extracted from archive, and only after determining that the archive
indeed has a copy of that particular file (and we can fetch it).
This at least has a fighting chance of not losing information. Wiping
pg_xlog in toto on the basis of a guess about the system status is just
a form of russian roulette. Sooner or later you will wipe some xlog
files that you can't get back from archive.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, markw(at)osdl(dot)org, kn(at)mgnet(dot)de, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 21:36:38
Message-ID:	200407192136.i6JLacL06105@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > we need to check to see which one has a WAL eof-of-segment marker (we
> > have on of those, right?).
>
> No, we don't.
>
> > I think I see a solution. We are going to create a file during backup so
> > we know the wal offsets and xids. If we see that file, we know either
> > we have a restore of a backup or they currently running a backup.
>
> ... or the last backup attempt failed, but they forgot to remove the
> file it left. Or we are doing crash recovery after the system lost
> power while a backup was running. Or half a dozen other obvious scenarios.
>
> > If we tell them not to restore while a backup is running (seems pretty
> > obvious) we can then delete pg_xlog when the backup wal offset file
> > exists. In other cases, we know the WAL files are valid to use.
>
> We're not deleting pg_xlog, period. IMHO it's too dangerous even to
> have such a function in the code.
>
> My original suggestion was to *replace* individual xlog files with data
> extracted from archive, and only after determining that the archive
> indeed has a copy of that particular file (and we can fetch it).
> This at least has a fighting chance of not losing information. Wiping
> pg_xlog in toto on the basis of a guess about the system status is just
> a form of russian roulette. Sooner or later you will wipe some xlog
> files that you can't get back from archive.

OK, if you don't want to place restrictions on recovery, fine, but how
do you handle the situation where you backup but the WAL file has holes
in the tar backup but you don't have an archive file to use because it
didn't make it to the archive before the drive died? Can we detect
holes in the WAL file recovered from backup? We might, but I am afraid
we might not.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, markw(at)osdl(dot)org, kn(at)mgnet(dot)de, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-19 22:14:24
Message-ID:	1090275263.28049.378.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Mon, 2004-07-19 at 17:56, Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Tom Lane wrote:
> >> I had second thoughts about that and didn't do it in the committed
> >> patch, though it's certainly still open for debate.
>
> > How are we handling a crash during recovery?
>
> Retry, perhaps. It doesn't seem any different from crash-during-recovery
> in the non-archived scenario ...
>

Well, a recovery is just re-applying already written logs at super
speed. We don't need to write WAL because we already wrote it once (and
that would really confuse the timeline issue).

I think if this was an issue, the solution would be to speed up recovery
since that would benefit us more than putting recovery-squared code in.

Just start over...

Best Regards, Simon Riggs

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-20 00:48:50
Message-ID:	40FC6BF2.5050701@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

I have been doing some re-testing with CVS HEAD from about 1 hour ago
using the simplified example posted previously.

It is quite interesting:

i) create the table as:

CREATE TABLE test0 (filler TEXT);

and COPY 100 000 rows on length 109, then recovery succeeds.

ii) create the table as:

CREATE TABLE test0 (filler VARCHAR(120));

and COPY as above, then recovery *fails* with the the signal 6 error below.

LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/A4807C
LOG: record with zero length at 0/FFFFE0
LOG: redo done at 0/FFFF30
LOG: restored log file "0000000000000000" from archive
LOG: archive recovery complete
PANIC: concurrent transaction log activity while database system is
shutting down
LOG: startup process (PID 17546) was terminated by signal 6
LOG: aborting startup due to startup process failure

(I am pretty sure both TEXT and VARCHAR(120) failed using the original
patch)

Any suggestions for the best way to dig a bit deeper?

regards

Mark

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, markw(at)osdl(dot)org, kn(at)mgnet(dot)de, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-20 01:22:42
Message-ID:	40FC73E2.4080100@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

> Okay, we agree on that part at least; I'll take care of it. If anyone
> wants to argue for further copying during initdb, that can be added
> later.

I reckon it should be copied into $PGDATA :) Otherwise, when I'm in a
panic at recovery time, I'd have to figure out where the heck my package
has installed the share conf file to, conf files usually aren't in
share, etc., etc.

Chris

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-20 04:14:29
Message-ID:	3607.1090296869@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Mark Kirkwood <markir(at)coretech(dot)co(dot)nz> writes:
> I have been doing some re-testing with CVS HEAD from about 1 hour ago
> using the simplified example posted previously.

> It is quite interesting:

The problem seems to be that the computation of checkPoint.redo at
xlog.c lines 4162-4169 (all line numbers are per CVS tip) is not
allowing for the possibility that XLogInsert will decide it doesn't
want to split the checkpoint record across XLOG files, and will then
insert a WASTED_SPACE record to avoid that (see comment and following
code at lines 758-795). This wouldn't really matter except that there
is a safety crosscheck at line 4268 that tries to detect unexpected
insertions of other records during a shutdown checkpoint.

I think the code in CreateCheckPoint was correct when it was written,
because we only recently changed XLogInsert to not split records
across files. But it's got a boundary-case bug now, which your test
scenario is able to exercise by making the recovery run try to write
a shutdown checkpoint exactly at the end of a WAL file segment.

The quick and dirty solution would be to dike out the safety check at
4268ff. I don't much care for that, but am too tired right now to work
out a better answer. I'm not real sure whether it's better to adjust
the computation of checkPoint.redo or to smarten the safety check
... but one or the other needs to allow for file-end padding, or maybe
we could hack some update of the state in WasteXLInsertBuffer(). (But
at some point you have to say "this is more trouble than it's worth",
so maybe we'll end up taking out the safety check.)

In any case this isn't a fundamental bug, just an insufficiently
smart safety check. But thanks for finding it! As is, the code has
a nonzero probability of failure in the field :-( and I don't know
how we'd have tracked it down without a reproducible test case.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-20 09:04:29
Message-ID:	1090314268.28049.2308.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

> The problem seems to be that the computation of checkPoint.redo at
> xlog.c lines 4162-4169 (all line numbers are per CVS tip) is not
> allowing for the possibility that XLogInsert will decide it doesn't
> want to split the checkpoint record across XLOG files, and will then
> insert a WASTED_SPACE record to avoid that (see comment and following
> code at lines 758-795). This wouldn't really matter except that there
> is a safety crosscheck at line 4268 that tries to detect unexpected
> insertions of other records during a shutdown checkpoint.
>
> I think the code in CreateCheckPoint was correct when it was written,
> because we only recently changed XLogInsert to not split records
> across files. But it's got a boundary-case bug now, which your test
> scenario is able to exercise by making the recovery run try to write
> a shutdown checkpoint exactly at the end of a WAL file segment.
>

Thanks for locating that, I was suspicious of that piece of code, but it
would have taken me longer than this to locate it exactly.

It was clear (to me) that it had to be of this nature, since I've done a
fair amount of recovery testing and not hit anything like that.

> The quick and dirty solution would be to dike out the safety check at
> 4268ff. I don't much care for that, but am too tired right now to work
> out a better answer. I'm not real sure whether it's better to adjust
> the computation of checkPoint.redo or to smarten the safety check
> ... but one or the other needs to allow for file-end padding, or maybe
> we could hack some update of the state in WasteXLInsertBuffer(). (But
> at some point you have to say "this is more trouble than it's worth",
> so maybe we'll end up taking out the safety check.)
>

I'll take a look

> In any case this isn't a fundamental bug, just an insufficiently
> smart safety check. But thanks for finding it! As is, the code has
> a nonzero probability of failure in the field :-( and I don't know
> how we'd have tracked it down without a reproducible test case.

All code has a non-zero probability of failure in the field, its just
they don't tell you that usually. The main thing here is that we write
everything we need to write to the logs in the first place.

If that is true, then the code can always be adjusted or the logs dumped
and re-spliced to recover data.

Definitely: Thanks Mark! Reproducibility is key.

Best regards, Simon Riggs

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-20 09:45:39
Message-ID:	40FCE9C3.3080502@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Great that it's not fundamental - and hopefully with this discovery, the
probability you mentioned is being squashed towards zero a bit more :-)

Don't let this early bug detract from what is really a superb piece of work!

regards

Mark

Tom Lane wrote:

>In any case this isn't a fundamental bug, just an insufficiently
>smart safety check. But thanks for finding it! As is, the code has
>a nonzero probability of failure in the field :-( and I don't know
>how we'd have tracked it down without a reproducible test case.
>
>
>
>

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-20 11:57:16
Message-ID:	1090324635.28049.2554.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Tue, 2004-07-20 at 05:14, Tom Lane wrote:
> Mark Kirkwood <markir(at)coretech(dot)co(dot)nz> writes:
> > I have been doing some re-testing with CVS HEAD from about 1 hour ago
> > using the simplified example posted previously.
>
> > It is quite interesting:
>
> The problem seems to be that the computation of checkPoint.redo at
> xlog.c lines 4162-4169 (all line numbers are per CVS tip) is not
> allowing for the possibility that XLogInsert will decide it doesn't
> want to split the checkpoint record across XLOG files, and will then
> insert a WASTED_SPACE record to avoid that (see comment and following
> code at lines 758-795). This wouldn't really matter except that there
> is a safety crosscheck at line 4268 that tries to detect unexpected
> insertions of other records during a shutdown checkpoint.
>
> I think the code in CreateCheckPoint was correct when it was written,
> because we only recently changed XLogInsert to not split records
> across files. But it's got a boundary-case bug now, which your test
> scenario is able to exercise by making the recovery run try to write
> a shutdown checkpoint exactly at the end of a WAL file segment.
>
> The quick and dirty solution would be to dike out the safety check at
> 4268ff.

Well, taking out the safety check isn't the answer.

The check produces the last error message "concurrent transaction...",
but it isn't the cause of the mismatch in the first place.

If you take out that check, we still fail because the wasted space at
the end is causing a "record with zero length" error.

> I'm not real sure whether it's better to adjust
> the computation of checkPoint.redo or to smarten the safety check
> ... but one or the other needs to allow for file-end padding, or maybe
> we could hack some update of the state in WasteXLInsertBuffer(). (But
> at some point you have to say "this is more trouble than it's worth",
> so maybe we'll end up taking out the safety check.)

...I'm looking at other options now.

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-20 12:51:54
Message-ID:	7787.1090327914@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> The quick and dirty solution would be to dike out the safety check at
>> 4268ff.

> If you take out that check, we still fail because the wasted space at
> the end is causing a "record with zero length" error.

Ugh. I'm beginning to think we ought to revert the patch that added the
don't-split-across-files logic to XLogInsert; that seems to have broken
more assumptions than I realized. That was added here:

2004-02-11 17:55 tgl

* src/: backend/access/transam/xact.c,
backend/access/transam/xlog.c, backend/access/transam/xlogutils.c,
backend/storage/smgr/md.c, backend/storage/smgr/smgr.c,
bin/pg_controldata/pg_controldata.c,
bin/pg_resetxlog/pg_resetxlog.c, include/access/xact.h,
include/access/xlog.h, include/access/xlogutils.h,
include/pg_config_manual.h, include/catalog/pg_control.h,
include/storage/smgr.h: Commit the reasonably uncontroversial parts
of J.R. Nield's PITR patch, to wit: Add a header record to each WAL
segment file so that it can be reliably identified. Avoid
splitting WAL records across segment files (this is not strictly
necessary, but makes it simpler to incorporate the header records).
Make WAL entries for file creation, deletion, and truncation (as
foreseen but never implemented by Vadim). Also, add support for
making XLOG_SEG_SIZE configurable at compile time, similarly to
BLCKSZ. Fix a couple bugs I introduced in WAL replay during recent
smgr API changes. initdb is forced due to changes in pg_control
contents.

There are other ways to do this, for example we could treat the WAL page
headers as variable-size, and stick the file labeling info into the
first page's header instead of making it be a separate record. The
separate-record way makes it easier to incorporate future additions to
the file labeling info, but I don't really think it's critical to allow
for that.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-20 13:11:27
Message-ID:	1090329086.28049.2708.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Tue, 2004-07-20 at 13:51, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> >> The quick and dirty solution would be to dike out the safety check at
> >> 4268ff.
>
> > If you take out that check, we still fail because the wasted space at
> > the end is causing a "record with zero length" error.
>
> Ugh. I'm beginning to think we ought to revert the patch that added the
> don't-split-across-files logic to XLogInsert; that seems to have broken
> more assumptions than I realized. That was added here:
>
> 2004-02-11 17:55 tgl
>
> * src/: backend/access/transam/xact.c,
> backend/access/transam/xlog.c, backend/access/transam/xlogutils.c,
> backend/storage/smgr/md.c, backend/storage/smgr/smgr.c,
> bin/pg_controldata/pg_controldata.c,
> bin/pg_resetxlog/pg_resetxlog.c, include/access/xact.h,
> include/access/xlog.h, include/access/xlogutils.h,
> include/pg_config_manual.h, include/catalog/pg_control.h,
> include/storage/smgr.h: Commit the reasonably uncontroversial parts
> of J.R. Nield's PITR patch, to wit: Add a header record to each WAL
> segment file so that it can be reliably identified. Avoid
> splitting WAL records across segment files (this is not strictly
> necessary, but makes it simpler to incorporate the header records).
> Make WAL entries for file creation, deletion, and truncation (as
> foreseen but never implemented by Vadim). Also, add support for
> making XLOG_SEG_SIZE configurable at compile time, similarly to
> BLCKSZ. Fix a couple bugs I introduced in WAL replay during recent
> smgr API changes. initdb is forced due to changes in pg_control
> contents.
>
> There are other ways to do this, for example we could treat the WAL page
> headers as variable-size, and stick the file labeling info into the
> first page's header instead of making it be a separate record. The
> separate-record way makes it easier to incorporate future additions to
> the file labeling info, but I don't really think it's critical to allow
> for that.
>

I think I've fixed it now...but wait 20

The problem was that a zero length XLOG_WASTED_SPACE record just fell
out of ReadRecord when it shouldn't have. By giving it a helping hand it
makes it through with pointers correctly set, and everything else was
already thought of in the earlier patch, so xlog_redo etc happens.

I'll update again in a few minutes....no point us both looking at this.

Best regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-20 14:00:35
Message-ID:	13342.1090332035@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Tue, 2004-07-20 at 13:51, Tom Lane wrote:
>> Ugh. I'm beginning to think we ought to revert the patch that added the
>> don't-split-across-files logic to XLogInsert; that seems to have broken
>> more assumptions than I realized.

> The problem was that a zero length XLOG_WASTED_SPACE record just fell
> out of ReadRecord when it shouldn't have. By giving it a helping hand it
> makes it through with pointers correctly set, and everything else was
> already thought of in the earlier patch, so xlog_redo etc happens.

Yeah, but the WASTED_SPACE/FILE_HEADER stuff is already pretty ugly, and
adding two more warts to the code to support it is sticking in my craw.
I'm thinking it would be cleaner to treat the extra labeling information
as an extension of the WAL page header.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-20 14:19:40
Message-ID:	1090333179.28049.2905.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Tue, 2004-07-20 at 14:11, Simon Riggs wrote:
> On Tue, 2004-07-20 at 13:51, Tom Lane wrote:
> > Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > >> The quick and dirty solution would be to dike out the safety check at
> > >> 4268ff.
> >
> > > If you take out that check, we still fail because the wasted space at
> > > the end is causing a "record with zero length" error.
> >
> > Ugh. I'm beginning to think we ought to revert the patch that added the
> > don't-split-across-files logic to XLogInsert; that seems to have broken
> > more assumptions than I realized. That was added here:
> >
> > 2004-02-11 17:55 tgl
> >
> > * src/: backend/access/transam/xact.c,
> > backend/access/transam/xlog.c, backend/access/transam/xlogutils.c,
> > backend/storage/smgr/md.c, backend/storage/smgr/smgr.c,
> > bin/pg_controldata/pg_controldata.c,
> > bin/pg_resetxlog/pg_resetxlog.c, include/access/xact.h,
> > include/access/xlog.h, include/access/xlogutils.h,
> > include/pg_config_manual.h, include/catalog/pg_control.h,
> > include/storage/smgr.h: Commit the reasonably uncontroversial parts
> > of J.R. Nield's PITR patch, to wit: Add a header record to each WAL
> > segment file so that it can be reliably identified. Avoid
> > splitting WAL records across segment files (this is not strictly
> > necessary, but makes it simpler to incorporate the header records).
> > Make WAL entries for file creation, deletion, and truncation (as
> > foreseen but never implemented by Vadim). Also, add support for
> > making XLOG_SEG_SIZE configurable at compile time, similarly to
> > BLCKSZ. Fix a couple bugs I introduced in WAL replay during recent
> > smgr API changes. initdb is forced due to changes in pg_control
> > contents.
> >
> > There are other ways to do this, for example we could treat the WAL page
> > headers as variable-size, and stick the file labeling info into the
> > first page's header instead of making it be a separate record. The
> > separate-record way makes it easier to incorporate future additions to
> > the file labeling info, but I don't really think it's critical to allow
> > for that.
> >
>
> I think I've fixed it now...but wait 20
>
> The problem was that a zero length XLOG_WASTED_SPACE record just fell
> out of ReadRecord when it shouldn't have. By giving it a helping hand it
> makes it through with pointers correctly set, and everything else was
> already thought of in the earlier patch, so xlog_redo etc happens.
>
> I'll update again in a few minutes....no point us both looking at this.
>

This was a very confusing test...Here's what I think happened:

Mark discovered a numerological coincidence that meant that the
XLOG_WASTED_SPACE record was zero length at the end of EACH file he was
writing to, as long as there was just that one writer. So no matter how
many records were inserted, each xlog file had a zero length
XLOG_WASTED_SPACE record at its end.

ReadRecord failed on seeing a zero length record, i.e. when it got to
the FIRST of the XLOG_WASTED_SPACE records. Thats why the test fails no
matter how many records you give it, as long as it was more than enough
to write into a second xlog segment file.

By telling ReadRecord that XLOG_WASTED_SPACE records of zero length are
in fact *OK*, it continues happily. (Thats just a partial fix, see
later)

The test works, but gives what looks like strange results: the test
blows away the data directory completely, so the then-current xlog dies
too. That contained the commit for the large COPY, so even though the
recovery now works, the table has zero rows in it. (When things die
you're still likely to lose *some* data).

Anyway, so then we put the "concurrent transaction" test back in and the
test passes because we have now set the pointers correctly.

After all that, I think the wasted space idea is still sensible. You
musn't have a continuation record across files, otherwise we'll end up
with half a commit one-day, which would break ACID.

I'm happy that we have the explicit test in XLogInsert for zero-length
records. Somebody will one-day write a resource manager with zero length
records when they didn't mean to and we need to catch that at write
time, not at recovery time like Mark has done. The WasteXLInsertBuffer
was the only part of the code that *can* write a zero-length record, so
we will *not* see another recurrence of this situation --at recovery
time--.

Though further concerns along this theme are:
- what happens when the space at the end of a file is so small we can't
even write a zero-length XLOG_WASTED_SPACE record? Hopefully, you're
gonna say "damn your eyes...couldnt you see that, its already there".
- if the space at the end of a file was just zeros, then the "concurrent
transaction test" would still fail....we probably need to enhance this
to treat a few zeros at end of file AS IF it was an XLOG_WASTED_SPACE
record an continue. (That scenario would happen if we were doing a
recovery that included a local un-archived xlog that was very close to
being full - probably more likely to occur in crash recovery than
archive recovery)

The included patch doesn't attempt to address those issues, yet.

Best regards, Simon Riggs

Attachment	Content-Type	Size
zerolength.patch	text/x-patch	1.4 KB

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-20 14:24:16
Message-ID:	1090333455.28049.2917.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Tue, 2004-07-20 at 15:00, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Tue, 2004-07-20 at 13:51, Tom Lane wrote:
> >> Ugh. I'm beginning to think we ought to revert the patch that added the
> >> don't-split-across-files logic to XLogInsert; that seems to have broken
> >> more assumptions than I realized.
>
> > The problem was that a zero length XLOG_WASTED_SPACE record just fell
> > out of ReadRecord when it shouldn't have. By giving it a helping hand it
> > makes it through with pointers correctly set, and everything else was
> > already thought of in the earlier patch, so xlog_redo etc happens.
>
> Yeah, but the WASTED_SPACE/FILE_HEADER stuff is already pretty ugly, and
> adding two more warts to the code to support it is sticking in my craw.
> I'm thinking it would be cleaner to treat the extra labeling information
> as an extension of the WAL page header.

Sounds like a better solution than scrabbling around at the end of file
with too many edge cases to test properly

...over to you then...

Best Regards, Simon Riggs

From:	markw(at)osdl(dot)org
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us, simon(at)2ndquadrant(dot)com
Cc:	pgman(at)candle(dot)pha(dot)pa(dot)us, kn(at)mgnet(dot)de, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-20 15:36:27
Message-ID:	200407201536.i6KFa7122392@mail.osdl.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On 18 Jul, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> Latest version, pitr_v5_2.patch...
>
> Reviewed and committed with some adjustments.

I pull from CVS and and got the following message when I tried starting
the database with the archive_mode parameter:

FATAL: unrecognized configuration parameter "archive_mode"

Have I missed something since it has been committed?

Mark

From:	Klaus Naumann <kn(at)mgnet(dot)de>
To:	markw(at)osdl(dot)org
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, simon(at)2ndquadrant(dot)com, pgman(at)candle(dot)pha(dot)pa(dot)us, kn(at)mgnet(dot)de, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-20 16:29:30
Message-ID:	Pine.LNX.4.58.0407201825340.930@spock.intra.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Tue, 20 Jul 2004 markw(at)osdl(dot)org wrote:

> FATAL: unrecognized configuration parameter "archive_mode"
>
> Have I missed something since it has been committed?

Yes, Tom has removed this option in favorite of just setting
archive_command to a value which then enables the PITR code also.

But as I've seen this isn't discussed to the very end currently.

My 2ct: I'd prefer to have archive_mode in the config as it really makes
clear that this database is archiving. I fear users will not understand
that giving a program for archival will also enable the PITR function.

Greetings, Klaus

--
Full Name : Klaus Naumann | (http://www.mgnet.de/) (Germany)
Phone / FAX : ++49/177/7862964 | E-Mail: (kn(at)mgnet(dot)de)

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Klaus Naumann <kn(at)mgnet(dot)de>
Cc:	markw(at)osdl(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-20 17:27:58
Message-ID:	1090344478.3377.8.camel@stromboli
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Tue, 2004-07-20 at 17:29, Klaus Naumann wrote:
> On Tue, 20 Jul 2004 markw(at)osdl(dot)org wrote:
>
> > FATAL: unrecognized configuration parameter "archive_mode"
> >
> > Have I missed something since it has been committed?
>
> Yes, Tom has removed this option in favorite of just setting
> archive_command to a value which then enables the PITR code also.
>
> But as I've seen this isn't discussed to the very end currently.
>
> My 2ct: I'd prefer to have archive_mode in the config as it really makes
> clear that this database is archiving. I fear users will not understand
> that giving a program for archival will also enable the PITR function.
>

I do also think that option should go back in, just to be explicit.

A more important omission is the deletion of a message to indicate that
the server is acting in archive_mode....so there's no visual clue in the
log to warn an admin that its been turned off now or incorrectly
specified (by somebody else, of course). (At least using the default log
mode).

Best Regards, Simon Riggs

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Klaus Naumann <kn(at)mgnet(dot)de>
Cc:	markw(at)osdl(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us, simon(at)2ndquadrant(dot)com, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-20 22:09:29
Message-ID:	40FD9819.50401@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

I'd vote for it as a clarity factor too.

Klaus Naumann wrote:

>On Tue, 20 Jul 2004 markw(at)osdl(dot)org wrote:
>
>
>
>>FATAL: unrecognized configuration parameter "archive_mode"
>>
>>Have I missed something since it has been committed?
>>
>>
>
>Yes, Tom has removed this option in favorite of just setting
>archive_command to a value which then enables the PITR code also.
>
>But as I've seen this isn't discussed to the very end currently.
>
>My 2ct: I'd prefer to have archive_mode in the config as it really makes
>clear that this database is archiving. I fear users will not understand
>that giving a program for archival will also enable the PITR function.
>
>Greetings, Klaus
>
>
>
>

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc:	Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us, simon(at)2ndquadrant(dot)com, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-21 01:41:37
Message-ID:	40FDC9D1.5060705@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

I'm in favour of how it is now, so long as the comment is clear. It's
the Unix Way :)

Chris

> I'd vote for it as a clarity factor too.
>
> Klaus Naumann wrote:
>
>> On Tue, 20 Jul 2004 markw(at)osdl(dot)org wrote:
>>
>>
>>
>>> FATAL: unrecognized configuration parameter "archive_mode"
>>>
>>> Have I missed something since it has been committed?
>>>
>>
>>
>> Yes, Tom has removed this option in favorite of just setting
>> archive_command to a value which then enables the PITR code also.
>>
>> But as I've seen this isn't discussed to the very end currently.
>>
>> My 2ct: I'd prefer to have archive_mode in the config as it really makes
>> clear that this database is archiving. I fear users will not understand
>> that giving a program for archival will also enable the PITR function.
>>
>> Greetings, Klaus
>>
>>
>>
>>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-21 02:27:49
Message-ID:	40FDD4A5.20006@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

FYI - I can confirm that the patch fixes main issue.

Simon Riggs wrote:

>
>This was a very confusing test...Here's what I think happened:
>.....
>The included patch doesn't attempt to address those issues, yet.
>
>Best regards, Simon Riggs
>
>
>
>

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-21 02:39:18
Message-ID:	40FDD756.9030005@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

This is presumably a standard feature of any PITR design - if the
failure event destroys the current transaction log, then you can only
recover transactions that committed in the last *archived* log.

regards

Mark

Simon Riggs wrote:

>
>The test works, but gives what looks like strange results: the test
>blows away the data directory completely, so the then-current xlog dies
>too. That contained the commit for the large COPY, so even though the
>recovery now works, the table has zero rows in it. (When things die
>you're still likely to lose *some* data).
>
>
>
>
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-21 05:49:54
Message-ID:	4589.1090388994@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> A more important omission is the deletion of a message to indicate that
> the server is acting in archive_mode....so there's no visual clue in the
> log to warn an admin that its been turned off now or incorrectly
> specified (by somebody else, of course). (At least using the default log
> mode).

Hmm, we are apparently not reading the same code. My copy shows

LOG: starting archive recovery
LOG: restore_command = "cp /home/postgres/testversion/archive/%f %p"
... blah blah ...
LOG: archive recovery complete

Which part of this is insufficiently clear?

regards, tom lane

From:	Klaus Naumann <kn(at)mgnet(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-21 07:49:01
Message-ID:	Pine.LNX.4.58.0407210946170.930@spock.intra.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Wed, 21 Jul 2004, Tom Lane wrote:

Hi Tom,

Simon doesn't mean the recovery part. Instead he means the "normal"
startup of the server. It has to be absolutely clear (in the logfile!) if
the server was started in archive mode or not. Otherwise you always have
to guess.
On server startup there should to be a message like

LOG: Database started in archive mode

LOG: Archive mode is DISABLED

To get the users attention.

Greetings, Klaus

> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > A more important omission is the deletion of a message to indicate that
> > the server is acting in archive_mode....so there's no visual clue in the
> > log to warn an admin that its been turned off now or incorrectly
> > specified (by somebody else, of course). (At least using the default log
> > mode).
>
> Hmm, we are apparently not reading the same code. My copy shows
>
> LOG: starting archive recovery
> LOG: restore_command = "cp /home/postgres/testversion/archive/%f %p"
> ... blah blah ...
> LOG: archive recovery complete
>
> Which part of this is insufficiently clear?
>
> regards, tom lane
>
>

--
Full Name : Klaus Naumann | (http://www.mgnet.de/) (Germany)
Phone / FAX : ++49/177/7862964 | E-Mail: (kn(at)mgnet(dot)de)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Klaus Naumann <kn(at)mgnet(dot)de>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, markw(at)osdl(dot)org, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-21 14:53:49
Message-ID:	8659.1090421629@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Klaus Naumann <kn(at)mgnet(dot)de> writes:
> Simon doesn't mean the recovery part. Instead he means the "normal"
> startup of the server. It has to be absolutely clear (in the logfile!) if
> the server was started in archive mode or not. Otherwise you always have
> to guess.

Why would you guess? "SHOW archive_command" will tell you, without
question, at any time. I don't see the point of placing such a message
in the postmaster log --- in normal circumstances the postmaster will
still be running long after its starting messages have been discarded
due to log rotation.

Also, the current implementation allows you to stop and start archiving
on-the-fly, so a start-time message would be an unreliable guide to what
the postmaster is actually doing at the moment.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, pgman(at)candle(dot)pha(dot)pa(dot)us, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-21 20:00:49
Message-ID:	1090439667.2658.1260.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Wed, 2004-07-21 at 15:53, Tom Lane wrote:
> Klaus Naumann <kn(at)mgnet(dot)de> writes:
> > Simon doesn't mean the recovery part. Instead he means the "normal"
> > startup of the server. It has to be absolutely clear (in the logfile!) if
> > the server was started in archive mode or not. Otherwise you always have
> > to guess.
>
> Why would you guess? "SHOW archive_command" will tell you, without
> question, at any time. I don't see the point of placing such a message
> in the postmaster log --- in normal circumstances the postmaster will
> still be running long after its starting messages have been discarded
> due to log rotation.
>
> Also, the current implementation allows you to stop and start archiving
> on-the-fly, so a start-time message would be an unreliable guide to what
> the postmaster is actually doing at the moment.
>

Overall, this is a small point and I think we should leave Tom alone, to
focus on the bigger issues that we care about.

Tom has done an amazingly good job in the last few days of refactoring
some reasonably ugly code on my part, all without a murmur. I relent on
this to allow everything to be finished in time.

The PITR journey has just begun, so there will be further opportunity to
discuss and agree what constitutes real issues and then correct them.
This may not be on that list later.

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-21 22:43:55
Message-ID:	22835.1090449835@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Tue, 2004-07-20 at 15:00, Tom Lane wrote:
>> Yeah, but the WASTED_SPACE/FILE_HEADER stuff is already pretty ugly, and
>> adding two more warts to the code to support it is sticking in my craw.
>> I'm thinking it would be cleaner to treat the extra labeling information
>> as an extension of the WAL page header.

> Sounds like a better solution than scrabbling around at the end of file
> with too many edge cases to test properly

This is done in CVS tip. Mark, could you retest to verify it's fixed?

regards, tom lane

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-22 00:39:59
Message-ID:	40FF0CDF.4070509@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Looks good to me. Log file numbering scheme seems to have changed - is
that part of the fix too?.

Tom Lane wrote:

>
>This is done in CVS tip. Mark, could you retest to verify it's fixed?
>
> regards, tom lane
>
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PITR COPY Failure (was Point in Time Recovery)
Date:	2004-07-22 00:43:07
Message-ID:	23762.1090456987@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Mark Kirkwood <markir(at)coretech(dot)co(dot)nz> writes:
> Looks good to me. Log file numbering scheme seems to have changed - is
> that part of the fix too?.

That's for timelines ... it's not directly related but I thought I
should put in both changes at once to avoid forcing an extra initdb.

regards, tom lane

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-22 01:51:04
Message-ID:	40FF1D88.30404@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Here is one for the 'idiot proof' category:

1) initdb and set archive_command
2) shutdown
3) do a backup
4) startup and run some transactions
5) shutdown and remove PGDATA
6) restore backup
7) startup

Obviously this does not work as the backup is performed with the
database shutdown.

This got me wondering for 2 reasons:

1) Some alternative database servers *require* a procedure like this to
enable their version of PITR - so the potential foot-gun thing is there.

2) Is is possible to make the recovery kick in even though pg_control
says the database state is shutdown?

Simon Riggs wrote:

>
>I was hoping some fiendish plans would be presented to me...
>
>But please start with "this feels like typical usage" and we'll go from
>there...the important thing is to try the first one.
>
>I've not done power off tests, yet. They need to be done just to
>check...actually you don't need to do this to test PITR...
>
>We need to exhaustive tests of...
>- power off
>- scp and cross network copies
>- all the permuted recovery options
>- archive_mode = off (i.e. current behaviour)
>- deliberately incorrectly set options (idiot-proof testing)
>
>
>
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-22 02:39:15
Message-ID:	2110.1090463955@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Mark Kirkwood <markir(at)coretech(dot)co(dot)nz> writes:
> Here is one for the 'idiot proof' category:
> 1) initdb and set archive_command
> 2) shutdown
> 3) do a backup
> 4) startup and run some transactions
> 5) shutdown and remove PGDATA
> 6) restore backup
> 7) startup

> Obviously this does not work as the backup is performed with the
> database shutdown.

Huh? It works fine.

The bit you may be missing is that if you blow away $PGDATA including
pg_xlog/, you won't be able to recover past whatever you have in your WAL
archive area. The archive is certainly not going to include the current
partially-filled WAL segment, and it might be missing a few earlier
segments if the archival process isn't speedy. So you need to keep
those recent segments in pg_xlog/ if you want to recover to current time
or near-current time.

I'm becoming more and more convinced that we should bite the bullet and
move pg_xlog/ to someplace that is not under $PGDATA. It would just
make things a whole lot more reliable, both for backup and to deal with
scenarios like yours above. I tried to talk Bruce into this on the
phone the other day, but he wouldn't bite. I still think it's a good
idea though. It would
(1) eliminate the problem that a tar backup of $PGDATA would restore
stale copies of xlog segments, because the tar wouldn't include
pg_xlog in the first place.
(2) eliminate the problem that a naive "rm -rf $PGDATA" would blow away
xlog segments that you still need.

A possible compromise is that we should strongly suggest that pg_xlog
be pushed out to another place and symlinked if you are going to use
WAL archiving. That's already considered good practice for performance
if you have a separate disk spindle to put WAL on. It'll just have
to be good practive for WAL archiving too.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-22 03:06:22
Message-ID:	200407220306.i6M36MU11209@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

I think we should push the partially complete WAL file to the archive
location before shutdown. I talked to you or Jan about it and you (or
Jan) wouldn't bite either, but I think when someone shuts down, they
assume they have things fully archived and can recover fully with a
previous backup and the archive files.

When you are running and finally fill up the WAL file it would then
overwrite the one in the archive but I think that is OK. Maybe we would
need to give it a special file extension so we only use it when we don't
have a full version.

---------------------------------------------------------------------------

Tom Lane wrote:
> Mark Kirkwood <markir(at)coretech(dot)co(dot)nz> writes:
> > Here is one for the 'idiot proof' category:
> > 1) initdb and set archive_command
> > 2) shutdown
> > 3) do a backup
> > 4) startup and run some transactions
> > 5) shutdown and remove PGDATA
> > 6) restore backup
> > 7) startup
>
> > Obviously this does not work as the backup is performed with the
> > database shutdown.
>
> Huh? It works fine.
>
> The bit you may be missing is that if you blow away $PGDATA including
> pg_xlog/, you won't be able to recover past whatever you have in your WAL
> archive area. The archive is certainly not going to include the current
> partially-filled WAL segment, and it might be missing a few earlier
> segments if the archival process isn't speedy. So you need to keep
> those recent segments in pg_xlog/ if you want to recover to current time
> or near-current time.
>
> I'm becoming more and more convinced that we should bite the bullet and
> move pg_xlog/ to someplace that is not under $PGDATA. It would just
> make things a whole lot more reliable, both for backup and to deal with
> scenarios like yours above. I tried to talk Bruce into this on the
> phone the other day, but he wouldn't bite. I still think it's a good
> idea though. It would
> (1) eliminate the problem that a tar backup of $PGDATA would restore
> stale copies of xlog segments, because the tar wouldn't include
> pg_xlog in the first place.
> (2) eliminate the problem that a naive "rm -rf $PGDATA" would blow away
> xlog segments that you still need.
>
> A possible compromise is that we should strongly suggest that pg_xlog
> be pushed out to another place and symlinked if you are going to use
> WAL archiving. That's already considered good practice for performance
> if you have a separate disk spindle to put WAL on. It'll just have
> to be good practive for WAL archiving too.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-22 03:12:21
Message-ID:	40FF3095.3030001@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Well that is interesting :_)

Here is what I am doing on the removal front (I am keeping pg_xlog *now*):

$ cd $PGDATA
$ pg_ctl stop
$ ls|grep -v pg_xlog|xargs rm -rf

The contents of the archive directory just before recovery starts:

$ ls -l $PGDATA/../7.5-archive
total 49212
-rw------- 1 postgres postgres 16777216 Jul 22 14:59
000000010000000000000000
-rw------- 1 postgres postgres 16777216 Jul 22 14:59
000000010000000000000001
-rw------- 1 postgres postgres 16777216 Jul 22 14:59
000000010000000000000002

But here is recovery startup log:

LOG: database system was shut down at 2004-07-22 14:58:57 NZST
LOG: starting archive recovery
LOG: restore_command = "cp /data1/pgdata/7.5-archive/%f %p"
cp: cannot stat `/data1/pgdata/7.5-archive/00000001.history': No such
file or directory
LOG: restored log file "000000010000000000000000" from archive
LOG: checkpoint record is at 0/A4D3E8
LOG: redo record is at 0/A4D3E8; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 496; next OID: 17229
LOG: archive recovery complete
LOG: database system is ready

regards

Mark

Tom Lane wrote:

>
>Huh? It works fine.
>
>The bit you may be missing is that if you blow away $PGDATA including
>pg_xlog/, you won't be able to recover past whatever you have in your WAL
>archive area. The archive is certainly not going to include the current
>partially-filled WAL segment, and it might be missing a few earlier
>segments if the archival process isn't speedy. So you need to keep
>those recent segments in pg_xlog/ if you want to recover to current time
>or near-current time.
>
>
>
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-22 03:29:08
Message-ID:	2534.1090466948@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> I think we should push the partially complete WAL file to the archive
> location before shutdown. ...
> When you are running and finally fill up the WAL file it would then
> overwrite the one in the archive but I think that is OK.

I don't think this can fly at all. Here are some off-the-top-of-the-head
objections:

1. We don't have the luxury of spending indefinite amounts of time to
do a database shutdown. Commonly we are under a twenty-second sentence
of death from init. I don't want to spend the 20 seconds waiting to see
if the archiver will manage to push 16MB onto a slow tape drive. Also,
if the archiver does fail to push the data in time, it'll likely leave a
broken (partial) xlog file in the archive, which would be really bad
news if the user then relies on that.

2. What if the archiver process entirely fails to push the file? (Maybe
there's not enough disk space, for instance.) In normal operation we'll
just retry every so often. We definitely can't do that during shutdown.

3. You're blithely assuming that the archival process can easily provide
overwrite semantics for multiple pushes of the same xlog filename. Stop
thinking about "cp to some directory" and start thinking "dump to tape"
or "burn onto CD" or something like that. We'll be raising the ante
considerably if we require the archive_command to deal with this.

I think the last one is really the most significant issue. We have to
keep the archiver API as simple as possible.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-22 03:36:05
Message-ID:	200407220336.i6M3a5l18707@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Agreed, it might not be possible, but your report does point out a
limitation in our implementation --- that a shutdown database contains
more information than a backup and the archive logs. That is not
intuitive.

In fact, if you shutdown your database and want to reproduce it on
another machine, how do you do it? Seems you have to copy pg_xlog
directory over to the new machine.

In fact, moving pg_xlog to a new location doesn't make that clear
either. Seems documentation might be the only way to make this clear.

One idea would be to just push the partial WAL file to the archive on
server shutdown and not reuse it and start with a new WAL file on
startup. At least for a normal system shutdown this will give us an
archive that contains all the information that is in pg_xlog.

---------------------------------------------------------------------------

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > I think we should push the partially complete WAL file to the archive
> > location before shutdown. ...
> > When you are running and finally fill up the WAL file it would then
> > overwrite the one in the archive but I think that is OK.
>
> I don't think this can fly at all. Here are some off-the-top-of-the-head
> objections:
>
> 1. We don't have the luxury of spending indefinite amounts of time to
> do a database shutdown. Commonly we are under a twenty-second sentence
> of death from init. I don't want to spend the 20 seconds waiting to see
> if the archiver will manage to push 16MB onto a slow tape drive. Also,
> if the archiver does fail to push the data in time, it'll likely leave a
> broken (partial) xlog file in the archive, which would be really bad
> news if the user then relies on that.
>
> 2. What if the archiver process entirely fails to push the file? (Maybe
> there's not enough disk space, for instance.) In normal operation we'll
> just retry every so often. We definitely can't do that during shutdown.
>
> 3. You're blithely assuming that the archival process can easily provide
> overwrite semantics for multiple pushes of the same xlog filename. Stop
> thinking about "cp to some directory" and start thinking "dump to tape"
> or "burn onto CD" or something like that. We'll be raising the ante
> considerably if we require the archive_command to deal with this.
>
> I think the last one is really the most significant issue. We have to
> keep the archiver API as simple as possible.
>
> regards, tom lane
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-22 03:42:22
Message-ID:	2654.1090467742@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Agreed, it might not be possible, but your report does point out a
> limitation in our implementation --- that a shutdown database contains
> more information than a backup and the archive logs. That is not
> intuitive.

That's only because you are clinging to the broken assumption that
pg_xlog/ is part of the database, rather than part of the logs.
Separate that out as a distinct entity, and all gets better.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-22 03:54:48
Message-ID:	200407220354.i6M3smv21365@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Agreed, it might not be possible, but your report does point out a
> > limitation in our implementation --- that a shutdown database contains
> > more information than a backup and the archive logs. That is not
> > intuitive.
>
> That's only because you are clinging to the broken assumption that
> pg_xlog/ is part of the database, rather than part of the logs.
> Separate that out as a distinct entity, and all gets better.

Imagine this. I stop the server. I have a tar backup and a copy of
the archive. I should be able to take them to another machine and
recover the system to the point I stopped.

You are saying I need a copy of pg_xlog directory too, and I need to
remove pg_xlog after I untar the data directory and put the saved
pg_xlog into there before I recover.

Should we create a server-side function that forces all WAL files to the
archive, including partially written ones. Maybe that fixes the problem
with people deleting pg_xlog before they untar. You tell them to run
the function before recovery. If the system can't be started, the it is
possible the WAL files are no good too, not sure.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-22 07:43:18
Message-ID:	1090482198.2660.4.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Thu, 2004-07-22 at 04:29, Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > I think we should push the partially complete WAL file to the archive
> > location before shutdown. ...
> > When you are running and finally fill up the WAL file it would then
> > overwrite the one in the archive but I think that is OK.
>
> I don't think this can fly at all. Here are some off-the-top-of-the-head
> objections:
>
> 1. We don't have the luxury of spending indefinite amounts of time to
> do a database shutdown. Commonly we are under a twenty-second sentence
> of death from init. I don't want to spend the 20 seconds waiting to see
> if the archiver will manage to push 16MB onto a slow tape drive. Also,
> if the archiver does fail to push the data in time, it'll likely leave a
> broken (partial) xlog file in the archive, which would be really bad
> news if the user then relies on that.
>
> 2. What if the archiver process entirely fails to push the file? (Maybe
> there's not enough disk space, for instance.) In normal operation we'll
> just retry every so often. We definitely can't do that during shutdown.
>
> 3. You're blithely assuming that the archival process can easily provide
> overwrite semantics for multiple pushes of the same xlog filename. Stop
> thinking about "cp to some directory" and start thinking "dump to tape"
> or "burn onto CD" or something like that. We'll be raising the ante
> considerably if we require the archive_command to deal with this.
>
> I think the last one is really the most significant issue. We have to
> keep the archiver API as simple as possible.
>

Not read whole chain of conversation...but this idea came up before and
was rejected then. I agree with the 3 objections to that thought above.

There's already enough copies of full xlogs around to worry about.

If you need more granularity, reduce size of xlog files....

(Tom, SUID would be the correct timeline id in that situation? )

More later, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-22 20:19:53
Message-ID:	12173.1090527593@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Mark Kirkwood <markir(at)coretech(dot)co(dot)nz> writes:
> 2) Is is possible to make the recovery kick in even though pg_control
> says the database state is shutdown?

Yeah, I think you are right: presence of recovery.conf should force a
WAL scan even if pg_control claims it's shut down. Fix committed.

regards, tom lane

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-22 23:08:09
Message-ID:	410048D9.8020707@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Excellent - Just updated and it is all good!

This change makes the whole "how do I do my backup" business nice and
basic - which the right way IMHO.

regards

Mark

Tom Lane wrote:

>Mark Kirkwood <markir(at)coretech(dot)co(dot)nz> writes:
>
>
>>2) Is is possible to make the recovery kick in even though pg_control
>>says the database state is shutdown?
>>
>>
>
>Yeah, I think you are right: presence of recovery.conf should force a
>WAL scan even if pg_control claims it's shut down. Fix committed.
>
> regards, tom lane
>
>

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-22 23:48:34
Message-ID:	1090540113.3057.17.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Thu, 2004-07-22 at 21:19, Tom Lane wrote:
> Mark Kirkwood <markir(at)coretech(dot)co(dot)nz> writes:
> > 2) Is is possible to make the recovery kick in even though pg_control
> > says the database state is shutdown?
>
> Yeah, I think you are right: presence of recovery.conf should force a
> WAL scan even if pg_control claims it's shut down. Fix committed.
>

This *should* be possible but I haven't tested it.

There is a code path on secondary checkpoints that indicates that crash
recovery can occur even when the database was shutdown, since the code
forces recovery whether it was or not. On that basis, this may work, but
is yet untested. I didn't mention this because it might interfere with
getting hot backup to work...

Best Regards, Simon Riggs

From:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-23 00:05:13
Message-ID:	41005639.9040107@coretech.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

I have tested the "cold" backup - and retested my previous scenarios
using "hot" backup (just to be sure) . They all work AFAICS!

cheers

Mark

Simon Riggs wrote:

>On Thu, 2004-07-22 at 21:19, Tom Lane wrote:
>
>
>>Mark Kirkwood <markir(at)coretech(dot)co(dot)nz> writes:
>>
>>
>>>2) Is is possible to make the recovery kick in even though pg_control
>>>says the database state is shutdown?
>>>
>>>
>>Yeah, I think you are right: presence of recovery.conf should force a
>>WAL scan even if pg_control claims it's shut down. Fix committed.
>>
>>
>>
>
>This *should* be possible but I haven't tested it.
>
>There is a code path on secondary checkpoints that indicates that crash
>recovery can occur even when the database was shutdown, since the code
>forces recovery whether it was or not. On that basis, this may work, but
>is yet untested. I didn't mention this because it might interfere with
>getting hot backup to work...
>
>Best Regards, Simon Riggs
>
>
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-23 02:01:24
Message-ID:	14500.1090548084@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Thu, 2004-07-22 at 21:19, Tom Lane wrote:
>> Yeah, I think you are right: presence of recovery.conf should force a
>> WAL scan even if pg_control claims it's shut down. Fix committed.

> This *should* be possible but I haven't tested it.

I did.

It's really not risky. The fact that the code doesn't look beyond the
checkpoint record when things seem to be kosher is just a speed
optimization (and probably a rather pointless one...) We have got to be
able to detect the end of WAL in any case, so we'd just find there are
no more records and stop.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-24 20:35:10
Message-ID:	1090701310.3057.118.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On Fri, 2004-07-23 at 01:05, Mark Kirkwood wrote:
> I have tested the "cold" backup - and retested my previous scenarios
> using "hot" backup (just to be sure) . They all work AFAICS!

> cheers

Yes, I'll drink to that! Thanks for your help.

Best Regards, Simon Riggs

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-28 16:21:47
Message-ID:	200407281621.i6SGLlD28030@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

I do think we need a boolean for start/stop of archiving, rather than
setting it to '' to turn it off. Tom, I think the group agreed to this
on clarity grounds. I would like the server to throw an error if you
try to turn on archiving and the command is set to ''.

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Wed, 2004-07-21 at 15:53, Tom Lane wrote:
> > Klaus Naumann <kn(at)mgnet(dot)de> writes:
> > > Simon doesn't mean the recovery part. Instead he means the "normal"
> > > startup of the server. It has to be absolutely clear (in the logfile!) if
> > > the server was started in archive mode or not. Otherwise you always have
> > > to guess.
> >
> > Why would you guess? "SHOW archive_command" will tell you, without
> > question, at any time. I don't see the point of placing such a message
> > in the postmaster log --- in normal circumstances the postmaster will
> > still be running long after its starting messages have been discarded
> > due to log rotation.
> >
> > Also, the current implementation allows you to stop and start archiving
> > on-the-fly, so a start-time message would be an unreliable guide to what
> > the postmaster is actually doing at the moment.
> >
>
> Overall, this is a small point and I think we should leave Tom alone, to
> focus on the bigger issues that we care about.
>
> Tom has done an amazingly good job in the last few days of refactoring
> some reasonably ugly code on my part, all without a murmur. I relent on
> this to allow everything to be finished in time.
>
> The PITR journey has just begun, so there will be further opportunity to
> discuss and agree what constitutes real issues and then correct them.
> This may not be on that list later.
>
> Best Regards, Simon Riggs
>
>

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-28 16:25:36
Message-ID:	200407281625.i6SGPak28629@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Here is another open PITR issue that I think will have to be addressed
in 7.6. If you do a critical transaction, but do nothing else for eight
hours, that critical transaction hasn't been archived yet. It is still
sitting in pg_xlog until the WAL file fills.

I think we will need to document this behavior and address it in some
way in 7.6. We can't assume that we can send multiple copies of pg_xlog
to the archive (partial and full ones) because we might be going to a
tape drive. However, this is a non-intuitive behavior of our archiver.
We might need to tell people to archive the most recent WAL file every
minute to some other location or something.

---------------------------------------------------------------------------

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: [ADMIN] Point in Time Recovery
Date:	2004-07-28 16:27:43
Message-ID:	200407281627.i6SGRhA28968@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

[ Sorry, sent to hackers now.]

I think we will need to document this behavior and address it in some
way in 7.6. We can't assume that we can send multiple copies of pg_xlog
to the archive (partial and full ones) because we might be going to a
tape drive. However, this is a non-intuitive behavior of our archiver.
We might need to tell people to copy the most recent WAL file every
minute to some other location or something.

---------------------------------------------------------------------------

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-28 16:58:52
Message-ID:	23608.1091033932@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> I do think we need a boolean for start/stop of archiving, rather than
> setting it to '' to turn it off. Tom, I think the group agreed to this
> on clarity grounds.

I didn't see any consensus there, nor do I see a point to it.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-29 06:10:18
Message-ID:	200407290610.i6T6AIS07458@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > I do think we need a boolean for start/stop of archiving, rather than
> > setting it to '' to turn it off. Tom, I think the group agreed to this
> > on clarity grounds.
>
> I didn't see any consensus there, nor do I see a point to it.

I saw a lot of people saying it was a good idea, and only you saying it
was a bad idea.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-29 17:07:51
Message-ID:	200407291707.i6TH7pK15909@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Bruce Momjian wrote:
>
> I do think we need a boolean for start/stop of archiving, rather than
> setting it to '' to turn it off. Tom, I think the group agreed to this
> on clarity grounds. I would like the server to throw an error if you
> try to turn on archiving and the command is set to ''.

Let me illustrate. To turn off archiving you have to change:

#archive_command = ''
archive_command = 'cp %p /mnt/server/archivedir/%f'

to

archive_command = ''
#archive_command = 'cp %p /mnt/server/archivedir/%f'

and if you comment both or neither, you have problems.

With a boolean it would be:

archive_mode = on
archive_command = 'cp %p /mnt/server/archivedir/%f'

archive_mode = off
archive_command = 'cp %p /mnt/server/archivedir/%f'

Now, if you say people will rarely turn archiving on/off, then one
parameter seems to make more sense.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-29 17:16:05
Message-ID:	15535.1091121365@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Now, if you say people will rarely turn archiving on/off, then one
> parameter seems to make more sense.

I really can't envision a situation where people would do that. If you
need PITR at all then you need it 24x7.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-07-29 17:33:35
Message-ID:	200407291733.i6THXZP20332@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Now, if you say people will rarely turn archiving on/off, then one
> > parameter seems to make more sense.
>
> I really can't envision a situation where people would do that. If you
> need PITR at all then you need it 24x7.

OK, then we are OK. If we find that isn't true, we can reevaluate.

From:	"Simon(at)2ndquadrant(dot)com" <simon(at)2ndquadrant(dot)com>
To:	"Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Klaus Naumann" <kn(at)mgnet(dot)de>, <markw(at)osdl(dot)org>, <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-08-14 00:50:22
Message-ID:	NOEFLCFHBPDAFHEIPGBOCEHDCCAA.simon@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

> Tom Lane wrote:
> > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > > Now, if you say people will rarely turn archiving on/off, then one
> > > parameter seems to make more sense.
> >
> > I really can't envision a situation where people would do that. If you
> > need PITR at all then you need it 24x7.
>
I agree. The second parameter is only there to clarify the intent.

8.0 does introduce two good reasons to turn it on/off, however:
- index build speedups
- COPY speedups

I would opt to make enabling/disabling archive_command require a postmaster
restart. That way there would be no capability to take advantage of the
incentive to turn it on/off.

For TODO:

It would be my intention (in 8.1) to make those available via switches e.g.
NOT LOGGED options on CREATE INDEX and COPY, to allow users to take
advantage of the no logging optimization without turning off PITR system
wide. (Just as this is possible in Oracle and Teradata).

I would also aim to make the first Insert Select into an empty table not
logged (optionally). This is an important optimization for Oracle, teradata
and DB2 (which uses NOT LOGGED INITIALLY).

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Simon(at)2ndquadrant(dot)com" <simon(at)2ndquadrant(dot)com>
Cc:	"Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Klaus Naumann" <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-08-14 02:29:09
Message-ID:	446.1092450549@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

"Simon(at)2ndquadrant(dot)com" <simon(at)2ndquadrant(dot)com> writes:
> I would opt to make enabling/disabling archive_command require a postmaster
> restart. That way there would be no capability to take advantage of the
> incentive to turn it on/off.

We're generally not in the habit of making GUC parameters more rigid
than the implementation absolutely requires.

> It would be my intention (in 8.1) to make those available via switches e.g.
> NOT LOGGED options on CREATE INDEX and COPY, to allow users to take
> advantage of the no logging optimization without turning off PITR system
> wide. (Just as this is possible in Oracle and Teradata).

Isn't this in direct conflict with your opinion above? And I cannot say
that I think this one is a good idea. We do not have support for
selective catalog xlogging; if you do something like this then you
*will* have a broken database after recovery, because it will contain
those indexes but with invalid contents.

> I would also aim to make the first Insert Select into an empty table not
> logged (optionally). This is an important optimization for Oracle, teradata
> and DB2 (which uses NOT LOGGED INITIALLY).

This is even worse: not only do you have a broken database, but you have
no way to recover. (At least with an unlogged index you could fix it by
REINDEX.) If you don't care about longevity of the table, then make it
a temp table.

The fact that Oracle does it does not automatically make it a good idea.

regards, tom lane

From:	"Simon(at)2ndquadrant(dot)com" <simon(at)2ndquadrant(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Klaus Naumann" <kn(at)mgnet(dot)de>, <markw(at)osdl(dot)org>, "Pgsql-Hackers(at)Postgresql(dot) Org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: NOT LOGGED options (was Point in Time Recovery )
Date:	2004-08-14 09:39:23
Message-ID:	NOEFLCFHBPDAFHEIPGBOAEHPCCAA.simon@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

> Tom Lane wrote
> "Simon(at)2ndquadrant(dot)com" <simon(at)2ndquadrant(dot)com> writes:
> > It would be my intention (in 8.1) to make those available via
> switches e.g.
> > NOT LOGGED options on CREATE INDEX and COPY, to allow users to take
> > advantage of the no logging optimization without turning off PITR system
> > wide. (Just as this is possible in Oracle and Teradata).
>
> Isn't this in direct conflict with your opinion above? And I cannot say
> that I think this one is a good idea. We do not have support for
> selective catalog xlogging; if you do something like this then you
> *will* have a broken database after recovery, because it will contain
> those indexes but with invalid contents.

No, its not in direct conflict. Turning OFF archive_mode would have a system
wide effect. The options described allow individual applications to make a
choice about whether certain very large operations are recoverable, or not.
I don't ever personally want to turn off system wide PITR, but there will be
times when I choose to avoid overhead on individual ops when the situation
dictates. This goes with your oft-mentioned dislike of systems that think
they know better than you do...

The first two optimizations have been included in 8.0 when archive_mode is
off. If there is a problem, then it will effect crash recovery of those
systems also. I suggest using exactly this optimisation, though under user
(application) control, rather than sysadmin control.

The challenges you mention have a solution. I wanted to add these to TODO,
not yet to discuss detailed implementation.

> > I would also aim to make the first Insert Select into an empty table not
> > logged (optionally). This is an important optimization for
> Oracle, teradata
> > and DB2 (which uses NOT LOGGED INITIALLY).
>
> This is even worse: not only do you have a broken database, but you have
> no way to recover. (At least with an unlogged index you could fix it by
> REINDEX.) If you don't care about longevity of the table, then make it
> a temp table.
>

It is frequently possible to use that route, though the option remains in
frequent use in other situations.

> The fact that Oracle does it does not automatically make it a good idea.
>

Amen to that. You will note that unless compatability has been a
requirement, there have been times I have not followed the Oracle path, e.g.
PITR design.

I admit it must seem strange that I tried so hard to put PITR in place, only
to suggest removing it, optionally...

Overall, the options I describe here have been in production use in major
enterprise Data Warehouse systems for almost 15 years now. Oracle and DB2
copied the original Teradata implementation; slowly because, they too,
didn't quickly or easily accept the wisdom. There is abosultely no doubt of
the true value of these optimisations - the TPC-H tests for all vendors make
use of those (hidden in the details of which load utility options are used,
or simply the default behaviour).

Logging only has value when the mean time to recover is low enough to make
recovery worthwhile. This can catch you in a bind because you have to decide
whether to reduce MTTR at the expense of 100% data recovery. For some big
systems, recovery is only an option if you exclude the biggest table(s). In
a Data Warehouse, where data is loaded in large volumes, it may only be
feasible to load it when you have this optimisation. In a recovery
situation, re-loading the largest fact tables from their original source
data files is more likely to be the best option, or in some cases, skipped
entirely in favour of loading new data.

I don't claim that everybody would want this, only that it is an extremely
beneficial optimisation for many very large databases - which is much of my
focus.

You've pointed out that I'm new "round here", which is certainly true - but
I have been many places... There are and will be many differences in
thinking that emerge from this; I regard all of this as synergy, not
argument.

Best Regards, Simon Riggs

From:	Manfred Spraul <manfred(at)colorfullife(dot)com>
To:	"Simon(at)2ndquadrant(dot)com" <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Klaus Naumann <kn(at)mgnet(dot)de>, markw(at)osdl(dot)org, "Pgsql-Hackers(at)Postgresql(dot) Org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: NOT LOGGED options (was Point in Time Recovery )
Date:	2004-08-18 16:52:42
Message-ID:	4123895A.6060203@colorfullife.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

Simon(at)2ndquadrant(dot)com wrote:

>>Tom Lane wrote
>>
>>
>>>NOT LOGGED options on CREATE INDEX and COPY, to allow users to take
>>>advantage of the no logging optimization without turning off PITR system
>>>wide. (Just as this is possible in Oracle and Teradata).
>>>
>>>
>>Isn't this in direct conflict with your opinion above? And I cannot say
>>that I think this one is a good idea. We do not have support for
>>selective catalog xlogging;
>>
Is it possible to skip the xlog fsync for NOT LOGGED transactions?

--
Manfred

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Manfred Spraul" <manfred(at)colorfullife(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Klaus Naumann" <kn(at)mgnet(dot)de>, <markw(at)osdl(dot)org>, "Pgsql-Hackers(at)Postgresql(dot) Org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: NOT LOGGED options (was Point in Time Recovery )
Date:	2004-08-18 21:28:12
Message-ID:	NOEFLCFHBPDAFHEIPGBOOENJCCAA.simon@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

> Manfred Spraul
> Simon(at)2ndquadrant(dot)com wrote:
>
> >>Tom Lane wrote
> >>
> >>
> >>>NOT LOGGED options on CREATE INDEX and COPY, to allow users to take
> >>>advantage of the no logging optimization without turning off
> PITR system
> >>>wide. (Just as this is possible in Oracle and Teradata).
> >>>
> >>>
> >>Isn't this in direct conflict with your opinion above? And I cannot say
> >>that I think this one is a good idea. We do not have support for
> >>selective catalog xlogging;
> >>
> Is it possible to skip the xlog fsync for NOT LOGGED transactions?
>

Hmm...good thinking...however,

For very large operations, its the volume of the xlog writes thats the
problem, not the fsync of the logs. The type of things I'm thinking about
are large CREATE INDEX and large COPY operations, for very large tasks i.e.
> 1Gb. These are most useful in data warehousing operations - which is about
20% of the user base according to the survey stats from www.postgresql.org.

The wal buffer only gets synced at end of transaction, or when the buffer is
full. On long operations there is still only one commit, so not fsyncing
there won't gain much. The buffer will fill up repeatedly and require
flushing - which you can't really skip because when you get to the commit
you need to be certain that everything is down to disk - there's not much
point fsyncing the commit if the previous wal records haven't been.

If there was a way to tell whether a block in the wal buffer had been
written by a NOT LOGGED transaction, then it might be possible to vary the
fsync behaviour accordingly. That's a good idea if thats what you meant,
though it would mean changing some critical, well tested code that every wal
record goes through. I'd rather simply not write wal at all for the certain
specific situations when the user requests it - there are already decision
points in the code for both the situations I've mentioned, since these have
been optimised in 8.0 for when archive_command has not been set. It would be
a simply matter to add in a check at that point.

Anyway...this is probably 8.1 stuff now.

Best Regards, Simon Riggs

From:	JEDIDIAH <jedi(at)nomad(dot)mishnet>
To:	pgsql-admin(at)postgresql(dot)org
Subject:	Re: [HACKERS] Point in Time Recovery
Date:	2004-09-30 15:54:50
Message-ID:	slrnclobg6.8ae.jedi@nomad.mishnet
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-admin pgsql-hackers pgsql-patches

On 2004-07-28, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
>
> Here is another open PITR issue that I think will have to be addressed
> in 7.6. If you do a critical transaction, but do nothing else for eight
> hours, that critical transaction hasn't been archived yet. It is still
> sitting in pg_xlog until the WAL file fills.
>
> I think we will need to document this behavior and address it in some
> way in 7.6. We can't assume that we can send multiple copies of pg_xlog
> to the archive (partial and full ones) because we might be going to a

If a particular transaction is so important that it absolutely
positively needs to be archived offline for PITR, then why not just mark
it that way or allow for the application to trigger archival of this
critical REDO?

> tape drive. However, this is a non-intuitive behavior of our archiver.
> We might need to tell people to archive the most recent WAL file every
> minute to some other location or something.

[deletia]

--
Negligence will never equal intent, no matter how you
attempt to distort reality to do so. This is what separates |||
the real butchers from average Joes (or Fritzes) caught up in / | \
events not in their control.