Re: PITR - Some data is not recovered.

Lists: pgsql-hackers
From: OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: PITR - Some data is not recovered.
Date: 2004-08-09 08:52:10
Message-ID: 200408091752.10131.okada.satoshi@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


I'm testing PITR using pgbench and postgresql ver.8.0bata.
I think that result of recovery is wrong. My test procedure
is as follows:
--------------------
I edited postgresql.conf for PITR and started the postmaster.
And I executed "pgbench -t 20000".
% pgbench -t 20000

I did backup procedure before end of pgbench.
% psql -c "SELECT pg_start_backup('label1')"
% cd $PGDATA
% tar cf /tmp/back.tar ./*
% psql -c "SELECT pg_stop_backup()"

restore and recovery
% pg_ctl stop
% cp -R $PGDATA/pg_xlog /tmp/.
% rm -rf $PGDATA/*
% cd $PGDATA
% tar xf /tmp/data.tar

create "recovery.conf" file

% rm -rf $PGDATA/pg_xlog
% cp -R /tmp/pg_xlog $PGDATA/.
% pg_ctl start

check data after recovery
% psql -c "SELECT count(*) from history"
count
-------
19999
(1 row)
------------------------
Number of records should be 20000, but result is 19999.
I found lack of data that was inserted near backup time, as a
result of comparing original "history" table and "history" table which
was recovered.
Is my backup procedure wrong?

Thanks,
Satoshi OKADA


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PITR - Some data is not recovered.
Date: 2004-08-09 16:59:10
Message-ID: 16403.1092070750@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp> writes:
> I'm testing PITR using pgbench and postgresql ver.8.0bata.

Is this actually the official beta1 version, or is it a snapshot from
last week sometime? In the first commit that had pg_start_backup(),
there wasn't any logic to ensure that the recovery replay would really
start from before the backup began. If a checkpoint occurred after you
started the tar run but before tar got around to copying pg_control,
the recovery wouldn't work properly. I can't be sure but your report
seems consistent with such a problem.

[ digs in CVS logs ] The necessary additional code was committed in
revision 1.155 of src/backend/access/transam/xlog.c. If you have 1.154
please update and try again. If you do have 1.155 or later then we
need to look closer.

regards, tom lane


From: OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PITR - Some data is not recovered.
Date: 2004-08-10 01:50:43
Message-ID: 411829F3.9040508@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

>OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp> writes:
>
>>I'm testing PITR using pgbench and postgresql ver.8.0bata.
>>
>
>Is this actually the official beta1 version, or is it a snapshot from
>last week sometime?
>
I got it from
ftp.postgresql.org/pub/source/v8.0.0beta/postgresql-8.0.0beta1.tar.gz, and
xlog.c revision number is 1.157.

Thanks,
Satoshi OKADA


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PITR - Some data is not recovered.
Date: 2004-08-10 02:04:53
Message-ID: 23126.1092103493@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp> writes:
> Tom Lane wrote:
>> Is this actually the official beta1 version, or is it a snapshot from
>> last week sometime?
>>
> I got it from
> ftp.postgresql.org/pub/source/v8.0.0beta/postgresql-8.0.0beta1.tar.gz, and
> xlog.c revision number is 1.157.

Okay, so theoretically it should work ... I'm trying it now to see
if I can reproduce the problem locally.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PITR - Some data is not recovered.
Date: 2004-08-10 03:19:25
Message-ID: 23698.1092107965@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Okay, so theoretically it should work ... I'm trying it now to see
> if I can reproduce the problem locally.

It took several tries, but eventually I did reproduce it. It seems
the triggering condition is for the REDO pointer to be before the
checkpoint record itself. Not sure why, yet, but it looks like the
record(s) in between are not restored for some reason...

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PITR - Some data is not recovered.
Date: 2004-08-10 21:31:07
Message-ID: 1690.1092173467@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp> writes:
> I'm testing PITR using pgbench and postgresql ver.8.0bata.
> I think that result of recovery is wrong.

Many thanks for this bug report. I have developed a patch (attached)
that seems to fix it for me. Please test and see if you can still
cause the problem.

BTW, I found that it was much easier to cause the bug to happen if you
force a long commit delay. To do this you need to have
fsync = on
commit_siblings = 1
commit_delay = 100000
and while the pgbench run is going, start a new psql and do
begin;
create table foo(f1 int);
(or any other command that modifies the database). If you are watching
things with "top" you should notice a drastic decrease in the CPU
consumption of the backend connected to pgbench as soon as you do this.
Then proceed with taking the on-line backup. You can end the extra psql
session after the backup is done, to allow pgbench to go back to
normal speed.

Doing it this way I got a failure just about every time (3 failures in 3
tries) whereas without, I had seen only one failure in four tries.

regards, tom lane

Index: src/backend/access/transam/xact.c
===================================================================
RCS file: /cvsroot/pgsql-server/src/backend/access/transam/xact.c,v
retrieving revision 1.177
diff -c -r1.177 xact.c
*** src/backend/access/transam/xact.c 3 Aug 2004 15:57:26 -0000 1.177
--- src/backend/access/transam/xact.c 10 Aug 2004 20:55:48 -0000
***************
*** 574,586 ****
START_CRIT_SECTION();

/*
! * We only need to log the commit in XLOG if the transaction made
! * any transaction-controlled XLOG entries or will delete files.
* (If it made no transaction-controlled XLOG entries, its XID
* appears nowhere in permanent storage, so no one else will ever care
! * if it committed.)
*/
madeTCentries = (MyLastRecPtr.xrecoff != 0);
if (madeTCentries || nrels > 0)
{
XLogRecData rdata[3];
--- 574,601 ----
START_CRIT_SECTION();

/*
! * If our transaction made any transaction-controlled XLOG entries,
! * we need to lock out checkpoint start between writing our XLOG
! * record and updating pg_clog. Otherwise it is possible for the
! * checkpoint to set REDO after the XLOG record but fail to flush the
! * pg_clog update to disk, leading to loss of the transaction commit
! * if we crash a little later. Slightly klugy fix for problem
! * discovered 2004-08-10.
! *
* (If it made no transaction-controlled XLOG entries, its XID
* appears nowhere in permanent storage, so no one else will ever care
! * if it committed; so it doesn't matter if we lose the commit flag.)
! *
! * Note we only need a shared lock.
*/
madeTCentries = (MyLastRecPtr.xrecoff != 0);
+ if (madeTCentries)
+ LWLockAcquire(CheckpointStartLock, LW_SHARED);
+
+ /*
+ * We only need to log the commit in XLOG if the transaction made
+ * any transaction-controlled XLOG entries or will delete files.
+ */
if (madeTCentries || nrels > 0)
{
XLogRecData rdata[3];
***************
*** 668,673 ****
--- 683,692 ----
TransactionIdCommitTree(nchildren, children);
}

+ /* Unlock checkpoint lock if we acquired it */
+ if (madeTCentries)
+ LWLockRelease(CheckpointStartLock);
+
END_CRIT_SECTION();
}

***************
*** 850,855 ****
--- 869,876 ----
*
* We do not flush XLOG to disk unless deleting files, since the
* default assumption after a crash would be that we aborted, anyway.
+ * For the same reason, we don't need to worry about interlocking
+ * against checkpoint start.
*/
if (MyLastRecPtr.xrecoff != 0 || nrels > 0)
{
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /cvsroot/pgsql-server/src/backend/access/transam/xlog.c,v
retrieving revision 1.158
diff -c -r1.158 xlog.c
*** src/backend/access/transam/xlog.c 9 Aug 2004 16:26:01 -0000 1.158
--- src/backend/access/transam/xlog.c 10 Aug 2004 20:55:49 -0000
***************
*** 4699,4704 ****
--- 4699,4713 ----
checkPoint.ThisTimeLineID = ThisTimeLineID;
checkPoint.time = time(NULL);

+ /*
+ * We must hold CheckpointStartLock while determining the checkpoint
+ * REDO pointer. This ensures that any concurrent transaction commits
+ * will be either not yet logged, or logged and recorded in pg_clog.
+ * See notes in RecordTransactionCommit().
+ */
+ LWLockAcquire(CheckpointStartLock, LW_EXCLUSIVE);
+
+ /* And we need WALInsertLock too */
LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);

/*
***************
*** 4731,4736 ****
--- 4740,4746 ----
ControlFile->checkPointCopy.redo.xrecoff)
{
LWLockRelease(WALInsertLock);
+ LWLockRelease(CheckpointStartLock);
LWLockRelease(CheckpointLock);
END_CRIT_SECTION();
return;
***************
*** 4789,4794 ****
--- 4799,4807 ----
* GetSnapshotData needs to get XidGenLock while holding SInvalLock,
* so there's a risk of deadlock. Need to find a better solution. See
* pgsql-hackers discussion of 17-Dec-01.
+ *
+ * XXX actually, the whole UNDO code is dead code and unlikely to ever
+ * be revived, so the lack of a good solution here is not troubling.
*/
#ifdef NOT_USED
checkPoint.undo = GetUndoRecPtr();
***************
*** 4798,4808 ****
#endif

/*
! * Now we can release insert lock, allowing other xacts to proceed
! * even while we are flushing disk buffers.
*/
LWLockRelease(WALInsertLock);

/*
* Get the other info we need for the checkpoint record.
*/
--- 4811,4823 ----
#endif

/*
! * Now we can release insert lock and checkpoint start lock, allowing
! * other xacts to proceed even while we are flushing disk buffers.
*/
LWLockRelease(WALInsertLock);

+ LWLockRelease(CheckpointStartLock);
+
/*
* Get the other info we need for the checkpoint record.
*/
Index: src/include/storage/lwlock.h
===================================================================
RCS file: /cvsroot/pgsql-server/src/include/storage/lwlock.h,v
retrieving revision 1.12
diff -c -r1.12 lwlock.h
*** src/include/storage/lwlock.h 11 Jun 2004 16:43:24 -0000 1.12
--- src/include/storage/lwlock.h 10 Aug 2004 20:55:49 -0000
***************
*** 36,41 ****
--- 36,42 ----
WALWriteLock,
ControlFileLock,
CheckpointLock,
+ CheckpointStartLock,
RelCacheInitLock,
BgWriterCommLock,


From: OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PITR - Some data is not recovered.
Date: 2004-08-11 03:00:59
Message-ID: 41198BEB.5010807@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

>OKADA Satoshi <okada(dot)satoshi(at)lab(dot)ntt(dot)co(dot)jp> writes:
>
>
>>I'm testing PITR using pgbench and postgresql ver.8.0bata.
>>I think that result of recovery is wrong.
>>
>>
>
>Many thanks for this bug report. I have developed a patch (attached)
>that seems to fix it for me. Please test and see if you can still
>cause the problem.
>
I got correct results! Thanks for developing patch.

I will test PITR about various patterns. It will report, if there are
some failures.

Thanks,
Satoshi OKADA