Quick Links

Re: [PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync

Lists:	pgsql-hackers

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	[PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync
Date:	2007-04-19 20:37:33
Message-ID:	4627D30D.8020803@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I believe I have discovered the following problem in pgsql 8.2 and HEAD,
concerning warm-standbys using WAL log shipping.

The problem is that after a crash, the master might complete incomplete
actions via rm_cleanup() - but since it won't wal-log those changes,
the slave won't know about this. This will at least prevent the creation
of any further restart points on the slave (because safe_restartpoint)
will never return true again - it it might even cause data corruption,
if subsequent wal records are interpreted wrongly by the slave because
it sees other data than the master did when it generated them.

Attached is a patch that lets RecoveryRestartPoint call all
rm_cleanup() methods and create a restart point whenever it encounters
a shutdown checkpoint in the wal (because those are generated after
recovery). This ought not cause a performance degradation, because
shutdown checkpoints will occur very infrequently.

The patch is per discussion with Simon Riggs.

I've not yet had a chance to test this patch, I only made sure
that it compiles. I'm sending this out now because I hope this
might make it into 8.2.4.

greetings, Florian Pflug

Attachment	Content-Type	Size
incompleteactions-bugfix.patch	text/x-patch	4.9 KB

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync
Date:	2007-04-23 10:14:05
Message-ID:	1177323246.3650.186.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2007-04-19 at 22:37 +0200, Florian G. Pflug wrote:

> I believe I have discovered the following problem in pgsql 8.2 and HEAD,
> concerning warm-standbys using WAL log shipping.
>
> The problem is that after a crash, the master might complete incomplete
> actions via rm_cleanup() - but since it won't wal-log those changes,
> the slave won't know about this. This will at least prevent the creation
> of any further restart points on the slave (because safe_restartpoint)
> will never return true again - it it might even cause data corruption,
> if subsequent wal records are interpreted wrongly by the slave because
> it sees other data than the master did when it generated them.

I agree the problem exists. It is somewhat rare because the idea is that
if the Primary crashes we would failover to the Standby, which would
mean that both Primary and Standby have executed rm_cleanup(), if
needed.

So in the case where the Primary fails and we choose *not* to failover,
there is a potential difficulty on the Standby.

> Attached is a patch that lets RecoveryRestartPoint call all
> rm_cleanup() methods and create a restart point whenever it encounters
> a shutdown checkpoint in the wal (because those are generated after
> recovery). This ought not cause a performance degradation, because
> shutdown checkpoints will occur very infrequently.
>
> The patch is per discussion with Simon Riggs.
>
> I've not yet had a chance to test this patch, I only made sure
> that it compiles. I'm sending this out now because I hope this
> might make it into 8.2.4.

The rationale for this fix could be described somewhat differently:

When we shutdown, we know for certain that safe_restartpoint() is true.
However, we don't know whether it is true because we successfully did a
clean shutdown, or because we crashed, recovered and then issued a
shutdown checkpoint following recovery. In the latter case we *must*
execute rm_cleanup() on the standby because it has been executed on the
primary. Not doing so at this point *might* be safe, but is not certain
to be safe. We don't *need* to log a restartpoint at this time, but it
seems sensible to do so.

We need to check that rm_cleanup() routines don't assume that they will
only ever be called once or this will clearly fail. There is also no
need to call rm_cleanup() unless rm_safe_restartpoint() is false.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync
Date:	2007-04-23 11:19:42
Message-ID:	462C964E.90405@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Thu, 2007-04-19 at 22:37 +0200, Florian G. Pflug wrote:
>> The problem is that after a crash, the master might complete incomplete
>> actions via rm_cleanup() - but since it won't wal-log those changes,
>> the slave won't know about this. This will at least prevent the creation
>> of any further restart points on the slave (because safe_restartpoint)
>> will never return true again - it it might even cause data corruption,
>> if subsequent wal records are interpreted wrongly by the slave because
>> it sees other data than the master did when it generated them.
>
> I agree the problem exists. It is somewhat rare because the idea is that
> if the Primary crashes we would failover to the Standby, which would
> mean that both Primary and Standby have executed rm_cleanup(), if
> needed.
>
> So in the case where the Primary fails and we choose *not* to failover,
> there is a potential difficulty on the Standby.

It occured to me today that this might plague 8.1 too. As you explain below,
the problem is not really connected to restartpoints - failing to create
them is merely a sympton of this. On 8.1, this might still lead to rm_cleanup()
being called much "later" (if you consider the wal position to be the "time")
on the slave than on the master. I dunno if this really causes trouble - I
don't yet understand the btree code well enough to judge this.

> The rationale for this fix could be described somewhat differently:
>
> When we shutdown, we know for certain that safe_restartpoint() is true.
> However, we don't know whether it is true because we successfully did a
> clean shutdown, or because we crashed, recovered and then issued a
> shutdown checkpoint following recovery. In the latter case we *must*
> execute rm_cleanup() on the standby because it has been executed on the
> primary. Not doing so at this point *might* be safe, but is not certain
> to be safe. We don't *need* to log a restartpoint at this time, but it
> seems sensible to do so.

While creating the patch, I've been thinking if it might be worthwile
to note that we just did recovery in the ShutdownCheckpoint
(or create a new checkpoint type RecoveryCheckpoint). This wouldl allow
for more error checking, because then the slave could check that
safe_restartpoint() is true for all ShutdownCheckpoints that were not
after recovering.

> We need to check that rm_cleanup() routines don't assume that they will
> only ever be called once or this will clearly fail. There is also no
> need to call rm_cleanup() unless rm_safe_restartpoint() is false.

But a non-idempotent rm_cleanup() routine will cause trouble anyway,
if postgres crashes after having called rm_cleanup() but before creating
the ShutdownCheckpoint.

greetings, Florian Pflug

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync
Date:	2007-04-27 02:35:40
Message-ID:	200704270235.l3R2ZeW02321@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------

Florian G. Pflug wrote:
> Hi
>
> I believe I have discovered the following problem in pgsql 8.2 and HEAD,
> concerning warm-standbys using WAL log shipping.
>
> The problem is that after a crash, the master might complete incomplete
> actions via rm_cleanup() - but since it won't wal-log those changes,
> the slave won't know about this. This will at least prevent the creation
> of any further restart points on the slave (because safe_restartpoint)
> will never return true again - it it might even cause data corruption,
> if subsequent wal records are interpreted wrongly by the slave because
> it sees other data than the master did when it generated them.
>
> Attached is a patch that lets RecoveryRestartPoint call all
> rm_cleanup() methods and create a restart point whenever it encounters
> a shutdown checkpoint in the wal (because those are generated after
> recovery). This ought not cause a performance degradation, because
> shutdown checkpoints will occur very infrequently.
>
> The patch is per discussion with Simon Riggs.
>
> I've not yet had a chance to test this patch, I only made sure
> that it compiles. I'm sending this out now because I hope this
> might make it into 8.2.4.
>
> greetings, Florian Pflug

>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync
Date:	2007-07-05 18:13:25
Message-ID:	19664.1183659205@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

[ back to dealing with this patch, finally ]

"Florian G. Pflug" <fgp(at)phlo(dot)org> writes:
> While creating the patch, I've been thinking if it might be worthwile
> to note that we just did recovery in the ShutdownCheckpoint
> (or create a new checkpoint type RecoveryCheckpoint). This wouldl allow
> for more error checking, because then the slave could check that
> safe_restartpoint() is true for all ShutdownCheckpoints that were not
> after recovering.

I concur that this is a good idea --- we should have a third checkpoint
record type that shows that a crash recovery occurred. However, we can
probably only do that for 8.3 and beyond. If we try to do it in
existing release branches then there's likelihood of trouble due to WAL
incompatibility between master and standby. While we do advise people
to update their standbys first, I don't think it's worth risking such
problems just to add some more error checking.

Conclusion: we should apply Florian's patch as-is in 8.2, do something
morally equivalent in 8.1 and before, and invent a
CrashRecoveryCheckpoint record type in HEAD.

regards, tom lane

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync
Date:	2007-07-05 18:43:31
Message-ID:	468D3BD3.2090901@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> [ back to dealing with this patch, finally ]
>
> "Florian G. Pflug" <fgp(at)phlo(dot)org> writes:
>> While creating the patch, I've been thinking if it might be worthwile
>> to note that we just did recovery in the ShutdownCheckpoint
>> (or create a new checkpoint type RecoveryCheckpoint). This wouldl allow
>> for more error checking, because then the slave could check that
>> safe_restartpoint() is true for all ShutdownCheckpoints that were not
>> after recovering.
>
> I concur that this is a good idea --- we should have a third checkpoint
> record type that shows that a crash recovery occurred. However, we can
> probably only do that for 8.3 and beyond. If we try to do it in
> existing release branches then there's likelihood of trouble due to WAL
> incompatibility between master and standby. While we do advise people
> to update their standbys first, I don't think it's worth risking such
> problems just to add some more error checking.
>
> Conclusion: we should apply Florian's patch as-is in 8.2, do something
> morally equivalent in 8.1 and before, and invent a
> CrashRecoveryCheckpoint record type in HEAD.

Sounds good.

Do you want me to code up such patches for 8.1 and 8.3 in the next days,
or is someone else already working on it?

greetings, Florian Pflug

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync
Date:	2007-07-05 18:49:13
Message-ID:	20161.1183661353@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Florian G. Pflug" <fgp(at)phlo(dot)org> writes:
> Tom Lane wrote:
>> Conclusion: we should apply Florian's patch as-is in 8.2, do something
>> morally equivalent in 8.1 and before, and invent a
>> CrashRecoveryCheckpoint record type in HEAD.

> Sounds good.

Actually, now that I look closer, this patch seems completely wrong.
It's predicated on an assumption that rm_cleanup won't write WAL entries
describing what it did ... but, at least in the btree case, it does.
(I think gist/gin might not, but that's a bug in those AMs not in xlog.)
I'm therefore wondering what test case led you to think there was
something wrong.

regards, tom lane

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync
Date:	2007-07-05 19:41:33
Message-ID:	468D496D.4020409@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> "Florian G. Pflug" <fgp(at)phlo(dot)org> writes:
>> Tom Lane wrote:
>>> Conclusion: we should apply Florian's patch as-is in 8.2, do something
>>> morally equivalent in 8.1 and before, and invent a
>>> CrashRecoveryCheckpoint record type in HEAD.
>
>> Sounds good.
>
> Actually, now that I look closer, this patch seems completely wrong.
> It's predicated on an assumption that rm_cleanup won't write WAL entries
> describing what it did ... but, at least in the btree case, it does.
> (I think gist/gin might not, but that's a bug in those AMs not in xlog.)
> I'm therefore wondering what test case led you to think there was
> something wrong.

It wasn't a testcase - I was trying to understand the xlog code while working
on my concurrent walreplay patch, and wondered what happens if the master
crashes and then recovery while the slave keeps running.

I've re-read my original email to Simon, and it seems that I believed
that rm_cleanup methods won't bee able to write to the xlog because they are
called during recovery.

But StartupXLOG *does* make the wal append able *before* the rm_cleanup methods
are called.

So I now think (at least for btree) that everything is fine, and that I was
just being stupid.

Sorry for the noise, guys
greetings, Florian Pflug