Re: Recovery inconsistencies, standby much larger than primary

From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: Recovery inconsistencies, standby much larger than primary
Date: 2014-02-12 12:52:20
Message-ID: CAM-w4HO_iqo97VPCs1ZigQk7MGePFjWUZf=wpsTPy-4izhiMJA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

So I think I've come up with a scenario that could cause this. I don't
think it's exactly what happened here but maybe something analogous
happened with our base backup restore.

On the primary you extend a table a bunch, including adding new
segments, but crash before committing (or checkpointing). Then some of
the blocks but not all may be written to disk. Assume they're all
written except for the last block of the first file. So what you have
is a .999G file followed by, day, 9 1G files. (Or maybe the hot backup
process could just catch the files in this state if a table is rapidly
growing and it doesn't take care to avoid picking up new files that
appear after it starts?)

smgrnblocks() stops at the first < 1GB segment and ignores the rest.
This code in xlog uses it to calculate how many blocks to add but it
only calls it once and then doesn't recheck where it's at as it
extends the relation. As soon as it adds that one missing block the
remaining files become visible. P_NEW always recalculates the position
based on smgrnblocks each time (which sounds pretty inefficient but
anyways....) so it will add the requested blocks to the new end.

Now this isn't enough to explain things since surely the extensions
records would be in the xlog in physical order. But this could have
all happened after an earlier vacuum truncated the relation and we
could be replaying records that predate that.

So in short, if you have a 10G table and want to overwrite the last
block but the first segment is one block short then xlog will add 9G
to the end and write the block there. That sounds like what we've
seen.

I think the easy fix is to change the code in xlogutils to be more
defensive and stop as soon as it finds BufferGetBlockNumber(buffer) ==
blkno (which is what it has in the assert already).
--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christian Kruse 2014-02-12 13:00:06 Re: Patch: show xid and xmin in pg_stat_activity and pg_stat_replication
Previous Message Andres Freund 2014-02-12 12:33:31 Re: [9.3 bug] disk space in pg_xlog increases during archive recovery