Re: crash-safe visibility map, take five

From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: crash-safe visibility map, take five
Date: 2011-06-17 03:17:52
Message-ID: 20110617031752.GA11947@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert,

I took a look at this patch. To summarize its purpose as I understand it, it
does two independent but synergistic things:

1. Move the INSERT/UPDATE/DELETE clearing of visibility map bits from after
the main operation to before it. This has no affect on crash recovery. It
closes a race condition described in the last paragraphs of visibilitymap.c's
header comment.

2. In the words of a comment added by the patch:
* The critical integrity requirement here is that we must never end up with
* a situation where the visibility map bit is set, and the page-level
* PD_ALL_VISIBLE bit is clear. If that were to occur, then a subsequent
* page modification would fail to clear the visibility map bit.
It does this by WAL-logging the operation of setting a vm bit. This also has
the benefit of getting vm bits set correctly on standbys.

On Mon, May 23, 2011 at 03:54:44PM -0400, Robert Haas wrote:
> On Wed, May 11, 2011 at 8:46 AM, Pavan Deolasee
> <pavan(dot)deolasee(at)gmail(dot)com> wrote:
> > Why do the set and clear functions need pass-by-reference (Buffer *)
> > argument ? I don't see them modifying the argument at all. This patch adds
> > the clear function, but the existing set function also suffers from that.
>
> Yep, I just copied the style of the existing function, in the interest
> of changing no more than necessary. But maybe it'd be better to
> change them both to take just the Buffer, instead of a pointer.
> Anyone else want to weigh in?

+1 for Buffer over Buffer * when the value isn't mutated for the caller.

I suggest revisiting the suggestion in
http://archives.postgresql.org/message-id/27743.1291135210@sss.pgh.pa.us and
including a latestRemovedXid in xl_heap_visible. The range of risky xids is
the same for setting a visibility map bit as for deleting a dead tuple, and
the same operation (VACUUM) produces both conflicts. The system that has
fueled my reports related to standby conflict would hit a snapshot conflict
more or less instantly without vacuum_defer_cleanup_age. (With 9.1, we'll use
hot_standby_feedback and never look back.) Adding visibility map bits as a
source of conflict costs nothing to a system like that. You might not like it
on an INSERT-only system whose VACUUMs only update visibility. Such systems
might like a GUC on the standby that disables both index-only scans and
conflict on xl_heap_visible.lastestRemovedXid. For normal read/write systems,
there will be no advantage.

The patch follows a design hashed in Nov-Dec 2010. Currently, it adds
overhead to VACUUM with thin practical benefit. However, there's broad
community acceptance that this is a step on a deterministic path to success at
implementing index-only scans, for worthy benefit.

lazy_scan_heap() has two calls to visibilitymap_set(), but the patch only
changed the recptr argument for one of them. This has the effect that we only
emit WAL for empty pages and pages that happened to have pd_lsn == {0,0}, such
as those not modified since the transaction creating the table. I fixed this
before testing further.

Next, for no especially good reason, I set a breakpoint on the closing brace
of visibilitymap_set, ran a VACUUM that called it, and instigated a PANIC from
gdb. Recovery failed like this:

3614 2011-06-16 20:05:18.118 EDT LOG: 00000: redo starts at 0/172ABA0
3614 2011-06-16 20:05:18.118 EDT LOCATION: StartupXLOG, xlog.c:6494
3614 2011-06-16 20:05:18.119 EDT FATAL: XX000: lock 148 is not held
3614 2011-06-16 20:05:18.119 EDT CONTEXT: xlog redo visible: rel 1663/16384/24588; blk 0
3614 2011-06-16 20:05:18.119 EDT LOCATION: LWLockRelease, lwlock.c:587
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1695)
3613 2011-06-16 20:05:18.119 EDT LOG: 00000: startup process (PID 3614) was terminated by signal 6: Aborted
3613 2011-06-16 20:05:18.119 EDT LOCATION: LogChildExit, postmaster.c:2881
3613 2011-06-16 20:05:18.119 EDT LOG: 00000: aborting startup due to startup process failure
3613 2011-06-16 20:05:18.119 EDT LOCATION: reaper, postmaster.c:2376

This happens due to heap_xlog_redo() calling UnlockReleaseBuffer() despite
having taken no buffer content lock. I added
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
before the "if" to get past this.

I proceeded to do some immediate-shutdown tests to see if we get the bits set
as expected. I set up a database like this:
CREATE EXTENSION pageinspect;
CREATE TABLE t (c int);
INSERT INTO t VALUES (2);
CHECKPOINT;
I normally cleared bits with "UPDATE t SET c = 1; CHECKPOINT;" and set them
with "VACUUM t". I checked bits with this query:
SELECT to_hex(get_byte(get_raw_page('t', 'vm', 0), 24)),
to_hex(flags::int) FROM page_header(get_raw_page('t', 0));
The row from that query should generally be 0,1 (bits unset) or 1,5 (bits
set). 0,5 is fine after a crash. 1,1 means we've broken our contract: the VM
bit is set while PD_ALL_VISIBLE is not set.

First test was to clear bits, checkpoint, then VACUUM and SIGQUIT the
postmaster. The system came back up with 1/1 bits. I poked around enough to
see that XLByteLE(lsn, PageGetLSN(page)) was failing, but I did not dig any
deeper toward a root cause. If you have trouble reproducing this, let me know
so I can assemble a complete, self-contained test case.

To crudely assess overhead, I created a table and set its hint bits like this:
create table t as select * from generate_series(1,30000000);
select count(*) from t;
autovacuum was off. I built with -O2 and assert checking. Then, I copied the
data directory and ran these commands multiple times with different builds:
SELECT pg_current_xlog_insert_location();
SET debug_assertions = off;
VACUUM t;
SELECT pg_current_xlog_insert_location();

Results:
-Build- -xlog before- -xlog after- -VACUUM time (ms)-
patched 0/171E278 0/1D3FE68 50329.218
patched 0/171E278 0/1D3FE68 48871.656
patched 0/171E278 0/1D3FE68 46545.024
unpatched 0/171E278 0/17202D0 49614.714
unpatched 0/171E278 0/17202D0 45966.985
unpatched 0/171E278 0/17202D0 44133.895

So, ~6 MiB of WAL to log the cleanness of ~1 GiB of table size. Seems small
enough to not worry about optimizing further at this time. I would like to
recheck it after we work out the functional issues, though.

No documentation updates present or needed. The patch adds no tests, but we
don't have any recovery tests in `make check'. That's bad, but it's not this
patch's job to fix it. Patch is a unified diff, but I suppose it won't matter
since you will be committing it.

The code looks pretty clean; a few minor comments and questions:

> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index 346d6b9..cf7d165 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c

> @@ -2298,6 +2333,10 @@ l1:
>
> LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>
> + /* We're done with the visibility map buffer */

I would delete this comment. We were done earlier, but we needed to finish the
critical section.

> + if (vmbuffer != InvalidBuffer)
> + ReleaseBuffer(vmbuffer);
> +
> /*
> * If the tuple has toasted out-of-line attributes, we need to delete
> * those items too. We have to do this before releasing the buffer

> @@ -4331,6 +4435,78 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
> UnlockReleaseBuffer(buffer);
> }
>
> +/*
> + * Replay XLOG_HEAP2_VISIBLE record.
> + *
> + * The critical integrity requirement here is that we must never end up with
> + * a situation where the visibility map bit is set, and the page-level
> + * PD_ALL_VISIBLE bit is clear. If that were to occur, then a subsequent
> + * page modification would fail to clear the visibility map bit.
> + */
> +static void
> +heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
> +{
> + xl_heap_visible *xlrec = (xl_heap_visible *) XLogRecGetData(record);
> + Buffer buffer;
> + Page page;
> +
> + /*
> + * Read the heap page, if it still exists. If the heap file has been
> + * dropped or truncated later in recovery, this might fail. In that case,
> + * there's no point in doing anything further, since the visibility map
> + * will have to be cleared out at the same time.
> + */
> + buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block,
> + RBM_NORMAL);
> + if (!BufferIsValid(buffer))
> + return;
> + page = (Page) BufferGetPage(buffer);
> +
> + /*
> + * We don't bump the LSN of the heap page when setting the visibility
> + * map bit, because that would generate an unworkable volume of
> + * full-page writes. This exposes us to torn page hazards, but since
> + * we're not inspecting the existing page contents in any way, we
> + * don't care.
> + *
> + * However, all operations that clear the visibility map bit *do* bump
> + * the LSN, and those operations will only be replayed if the XLOG LSN
> + * precedes the page LSN. Thus, if the page LSN has advanced past our
> + * XLOG record's LSN, we mustn't mark the page all-visible, because
> + * the subsequent update won't be replayed to clear the flag.
> + */

Concerning the optimization in xlog_heap_delete() et al. of not changing the
page when its LSN is newer -- am I correct that it only ever triggers when
full_page_writes = off? Assuming yes ...

> + if (XLByteLE(lsn, PageGetLSN(page)))

(As mentioned above, this test failed improperly in one of my tests.)

> + {
> + PageSetAllVisible(page);
> + MarkBufferDirty(buffer);
> + }
> +
> + /* Done with heap page. */
> + UnlockReleaseBuffer(buffer);

(As mentioned above, no lock is held at this point.)

> +
> + /*
> + * Even we skipped the heap page update due to the LSN interlock, it's
> + * still safe to update the visibility map. Any WAL record that clears
> + * the visibility map bit does so before checking the page LSN, so any
> + * bits that need to be cleared will still be cleared.
> + */
> + if (record->xl_info & XLR_BKP_BLOCK_1)
> + RestoreBkpBlocks(lsn, record, false);
> + else
> + {
> + Relation reln;
> + Buffer vmbuffer = InvalidBuffer;
> +
> + reln = CreateFakeRelcacheEntry(xlrec->node);
> + visibilitymap_pin(reln, xlrec->block, &vmbuffer);
> + /* Don't set the bit if replay has already passed this point. */
> + if (XLByteLE(lsn, PageGetLSN(BufferGetPage(vmbuffer))))
> + visibilitymap_set(reln, xlrec->block, lsn, &vmbuffer);

... wouldn't it be better to do this unconditionally? Some later record will
unset it if needed, because all replay functions that clear the bit do so
without consulting the vm page LSN. On the other hand, the worst consequence
of the way you've done it is VACUUM visiting the page one more time to set the
bit.

> + ReleaseBuffer(vmbuffer);
> + FreeFakeRelcacheEntry(reln);
> + }
> +}
> +
> static void
> heap_xlog_newpage(XLogRecPtr lsn, XLogRecord *record)
> {

> --- a/src/backend/access/heap/hio.c
> +++ b/src/backend/access/heap/hio.c

> @@ -237,23 +239,37 @@ RelationGetBufferForTuple(Relation relation, Size len,
> * Read and exclusive-lock the target block, as well as the other
> * block if one was given, taking suitable care with lock ordering and
> * the possibility they are the same block.
> + *
> + * If the page-level all-visible flag is set, caller will need to clear
> + * both that and the corresponding visibility map bit. However, by the
> + * time we return, we'll have x-locked the buffer, and we don't want to
> + * do any I/O while in that state. So we check the bit here before
> + * taking the lock, and pin the page if it appears necessary.
> + * Checking without the lock creates a risk of getting the wrong
> + * answer, so we'll have to recheck after acquiring the lock.
> */

I think it's worth noting, perhaps based on your explanation in the
second-to-last paragraph of
http://archives.postgresql.org/message-id/BANLkTi=b7jVmq6fA_EXLCYgzuyV1u9at4A@mail.gmail.com
that the answer may be incorrect again after the recheck. We don't care:
redundant clearings of the visibility bit are no problem.

> @@ -261,11 +277,36 @@ RelationGetBufferForTuple(Relation relation, Size len,
> {
> /* lock target buffer first */
> buffer = ReadBuffer(relation, targetBlock);
> + if (PageIsAllVisible(BufferGetPage(buffer)))
> + visibilitymap_pin(relation, targetBlock, vmbuffer);
> LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
> LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
> }
>
> /*
> + * If the page is all visible but we don't have the right visibility
> + * map page pinned, then give up our locks, go get the pin, and
> + * re-lock. This is pretty painful, but hopefully shouldn't happen
> + * often. Note that there's a small possibility that we didn't pin
> + * the page above but still have the correct page pinned anyway, either
> + * because we've already made a previous pass through this loop, or
> + * because caller passed us the right page anyway.
> + */
> + if (PageIsAllVisible(BufferGetPage(buffer))
> + && !visibilitymap_pin_ok(targetBlock, *vmbuffer))
> + {
> + LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> + if (otherBlock != targetBlock)
> + LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
> + visibilitymap_pin(relation, targetBlock, vmbuffer);

Suppose you pin one VM page for a certain candidate heap page, then loop
around to another candidate that has a different VM page. I don't see
anywhere that you ReleaseBuffer() the first vm page; am I missing it? I
didn't try to construct a test case, but a relation with a too-small free
space block below 512 MiB and a sufficiently-large block above 512 MiB should
tickle that case. The function header comment should document that vmbuffer,
if not InvalidBuffer, must be pinned and will be unpinned if modified (or
whatever other semantics you choose).

> + if (otherBuffer != InvalidBuffer && otherBlock < targetBlock)
> + LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
> + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
> + if (otherBuffer != InvalidBuffer && otherBlock > targetBlock)
> + LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
> + }
> +
> + /*
> * Now we can check to see if there's enough free space here. If so,
> * we're done.
> */
> diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
> index 58bab7d..3a33463 100644
> --- a/src/backend/access/heap/visibilitymap.c
> +++ b/src/backend/access/heap/visibilitymap.c
> @@ -11,10 +11,11 @@
> * src/backend/access/heap/visibilitymap.c
> *
> * INTERFACE ROUTINES
> - * visibilitymap_clear - clear a bit in the visibility map
> - * visibilitymap_pin - pin a map page for setting a bit
> - * visibilitymap_set - set a bit in a previously pinned page
> - * visibilitymap_test - test if a bit is set
> + * visibilitymap_clear - clear a bit in the visibility map
> + * visibilitymap_pin - pin a map page for setting a bit
> + * visibilitymap_pin_ok - check whether correct map page is already pinned
> + * visibilitymap_set - set a bit in a previously pinned page
> + * visibilitymap_test - test if a bit is set

Several lines above are inconsistent on internal tabs/spaces.

> *
> * NOTES
> *
> @@ -76,19 +77,11 @@
> * index would still be visible to all other backends, and deletions wouldn't
> * be visible to other backends yet. (But HOT breaks that argument, no?)

The two paragraphs ending above are ready to be removed.

> *
> - * There's another hole in the way the PD_ALL_VISIBLE flag is set. When
> - * vacuum observes that all tuples are visible to all, it sets the flag on
> - * the heap page, and also sets the bit in the visibility map. If we then
> - * crash, and only the visibility map page was flushed to disk, we'll have
> - * a bit set in the visibility map, but the corresponding flag on the heap
> - * page is not set. If the heap page is then updated, the updater won't
> - * know to clear the bit in the visibility map. (Isn't that prevented by
> - * the LSN interlock?)
> - *
> *-------------------------------------------------------------------------
> */
> #include "postgres.h"
>
> +#include "access/heapam.h"
> #include "access/visibilitymap.h"
> #include "storage/bufmgr.h"
> #include "storage/bufpage.h"
> @@ -127,38 +120,37 @@ static void vm_extend(Relation rel, BlockNumber nvmblocks);
> /*
> * visibilitymap_clear - clear a bit in visibility map
> *
> - * Clear a bit in the visibility map, marking that not all tuples are
> - * visible to all transactions anymore.
> + * You must pass a buffer containing the correct map page to this function.
> + * Call visibilitymap_pin first to pin the right one. This function doesn't do
> + * any I/O.
> */
> void
> -visibilitymap_clear(Relation rel, BlockNumber heapBlk)
> +visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer *buf)
> {
> BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
> int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
> int mapBit = HEAPBLK_TO_MAPBIT(heapBlk);
> uint8 mask = 1 << mapBit;
> - Buffer mapBuffer;
> char *map;
>
> #ifdef TRACE_VISIBILITYMAP
> elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
> #endif
>
> - mapBuffer = vm_readbuf(rel, mapBlock, false);
> - if (!BufferIsValid(mapBuffer))
> - return; /* nothing to do */
> + if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != mapBlock)
> + elog(ERROR, "wrong buffer passed to visibilitymap_clear");
>

This could just as well be an Assert.

> - LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
> - map = PageGetContents(BufferGetPage(mapBuffer));
> + LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE);
> + map = PageGetContents(BufferGetPage(*buf));
>
> if (map[mapByte] & mask)
> {
> map[mapByte] &= ~mask;
>
> - MarkBufferDirty(mapBuffer);
> + MarkBufferDirty(*buf);
> }
>
> - UnlockReleaseBuffer(mapBuffer);
> + LockBuffer(*buf, BUFFER_LOCK_UNLOCK);
> }
>
> /*
> @@ -194,15 +186,32 @@ visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf)
> }
>
> /*
> + * visibilitymap_pin_ok - do we already have the correct page pinned?
> + *
> + * On entry, buf should be InvalidBuffer or a valid buffer returned by
> + * an earlier call to visibilitymap_pin or visibilitymap_test on the same
> + * relation. The return value indicates whether the buffer covers the
> + * given heapBlk.
> + */
> +bool
> +visibilitymap_pin_ok(BlockNumber heapBlk, Buffer buf)
> +{
> + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
> +
> + return BufferIsValid(buf) && BufferGetBlockNumber(buf) == mapBlock;
> +}

This fills RelationGetBufferForTuple()'s needs, but the name doesn't seem to
fit the task too well. This function doesn't check the pin or that the buffer
applies to the correct relation. Consider naming it something like
`visibilitymap_buffer_covers_block'.

> +
> +/*
> * visibilitymap_set - set a bit on a previously pinned page
> *
> - * recptr is the LSN of the heap page. The LSN of the visibility map page is
> - * advanced to that, to make sure that the visibility map doesn't get flushed
> - * to disk before the update to the heap page that made all tuples visible.
> + * recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
> + * or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
> + * one provided; in normal running, we generate a new XLOG record and set the
> + * page LSN to that value.

I like the sentiment of sharing more code between the regular and redo cases,
but I don't care for this approach. This doesn't look like an invocation that
skips WAL for the operation, but it is:
visibilitymap_set(onerel, blkno, PageGetLSN(page), &vmbuffer);
In fact, it's a near-certainly wrong way to call this function, because
PageGetLSN(page) may or may not be valid. On the other hand, with just three
calls sites and those not likely to grow much, we can just get it right.
Still, consider adding:
Assert(InRecovery || XLogRecPtrIsInvalid(recptr));

> *
> - * This is an opportunistic function. It does nothing, unless *buf
> - * contains the bit for heapBlk. Call visibilitymap_pin first to pin
> - * the right map page. This function doesn't do any I/O.
> + * You must pass a buffer containing the correct map page to this function.
> + * Call visibilitymap_pin first to pin the right one. This function doesn't do
> + * any I/O.
> */
> void
> visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr,
> @@ -220,7 +229,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr,
>
> /* Check that we have the right page pinned */
> if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != mapBlock)
> - return;
> + elog(ERROR, "wrong buffer passed to visibilitymap_set");

Likewise, this could just as well be an Assert.

>
> page = BufferGetPage(*buf);
> map = PageGetContents(page);
> @@ -230,7 +239,10 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr,
> {
> map[mapByte] |= (1 << mapBit);
>
> - if (XLByteLT(PageGetLSN(page), recptr))
> + if (XLogRecPtrIsInvalid(recptr) && RelationNeedsWAL(rel))
> + recptr = log_heap_visible(rel->rd_node, heapBlk, *buf);
> +
> + if (!XLogRecPtrIsInvalid(recptr))
> PageSetLSN(page, recptr);
> PageSetTLI(page, ThisTimeLineID);

Might this be clearer as follows?

if (RelationNeedsWAL(rel))
{
if (XLogRecPtrIsInvalid(recptr))
recptr = log_heap_visible(rel->rd_node, heapBlk, *buf);
PageSetLSN(page, recptr);
PageSetTLI(page, ThisTimeLineID);
}

> --- a/src/backend/commands/vacuumlazy.c
> +++ b/src/backend/commands/vacuumlazy.c
> @@ -479,7 +479,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
> visibilitymap_pin(onerel, blkno, &vmbuffer);
> LockBuffer(buf, BUFFER_LOCK_SHARE);
> if (PageIsAllVisible(page))
> - visibilitymap_set(onerel, blkno, PageGetLSN(page), &vmbuffer);
> + visibilitymap_set(onerel, blkno, InvalidXLogRecPtr,
> + &vmbuffer);

As mentioned above, there's another call to visibilitymap_set() later in this
function that needs the same change.

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-06-17 03:44:55 Re: [HACKERS] Issues with generate_series using integer boundaries
Previous Message Robert Haas 2011-06-17 02:38:31 Re: pg_upgrade using appname to lock out other users