Re: crash-safe visibility map, take five

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: crash-safe visibility map, take five
Date: 2011-06-17 17:21:17
Message-ID: BANLkTimiT4q+2Y8Mj1JghQtubtQtrkKGaQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 16, 2011 at 11:17 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> I took a look at this patch.

No kidding! Thanks for the very detailed review.

> +1 for Buffer over Buffer * when the value isn't mutated for the caller.

I changed this.

> I suggest revisiting the suggestion in
> http://archives.postgresql.org/message-id/27743.1291135210@sss.pgh.pa.us and
> including a latestRemovedXid in xl_heap_visible.  The range of risky xids is
> the same for setting a visibility map bit as for deleting a dead tuple, and
> the same operation (VACUUM) produces both conflicts.

See Heikki's follow-up email. The standby has to ignore all-visible
bits anyway, because the fact that a transaction is all-visible on the
master does not imply that it is all-visible on the standby. So I
don't think there's a problem here.

> lazy_scan_heap() has two calls to visibilitymap_set(), but the patch only
> changed the recptr argument for one of them.  This has the effect that we only
> emit WAL for empty pages and pages that happened to have pd_lsn == {0,0}, such
> as those not modified since the transaction creating the table.  I fixed this
> before testing further.

Good catch, thanks. I also added the Assert() that you recommended
further down.

> This happens due to heap_xlog_redo() calling UnlockReleaseBuffer() despite
> having taken no buffer content lock.  I added
>        LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
> before the "if" to get past this.

Fixed, thanks.

> I proceeded to do some immediate-shutdown tests to see if we get the bits set
> as expected.  I set up a database like this:
>        CREATE EXTENSION pageinspect;
>        CREATE TABLE t (c int);
>        INSERT INTO t VALUES (2);
>        CHECKPOINT;
> I normally cleared bits with "UPDATE t SET c = 1; CHECKPOINT;" and set them
> with "VACUUM t".  I checked bits with this query:
>        SELECT to_hex(get_byte(get_raw_page('t', 'vm', 0), 24)),
>               to_hex(flags::int) FROM page_header(get_raw_page('t', 0));
> The row from that query should generally be 0,1 (bits unset) or 1,5 (bits
> set).  0,5 is fine after a crash.  1,1 means we've broken our contract: the VM
> bit is set while PD_ALL_VISIBLE is not set.
>
> First test was to clear bits, checkpoint, then VACUUM and SIGQUIT the
> postmaster.  The system came back up with 1/1 bits.  I poked around enough to
> see that XLByteLE(lsn, PageGetLSN(page)) was failing, but I did not dig any
> deeper toward a root cause.  If you have trouble reproducing this, let me know
> so I can assemble a complete, self-contained test case.

Thank you for putting these test cases together. As you can probably
tell, I was having difficulty figuring out exactly how to test this.

I think that the problem here is that the sense of that test is
exactly backwards from what it should be. IIUC, the word "precedes"
in the previous comment should in fact say "follows".

> I would delete this comment.  We were done earlier, but we needed to finish the
> critical section.

Done.

> Concerning the optimization in xlog_heap_delete() et al. of not changing the
> page when its LSN is newer -- am I correct that it only ever triggers when
> full_page_writes = off?  Assuming yes ...

I believe that's right.

>> +     /*
>> +      * Even we skipped the heap page update due to the LSN interlock, it's
>> +      * still safe to update the visibility map.  Any WAL record that clears
>> +      * the visibility map bit does so before checking the page LSN, so any
>> +      * bits that need to be cleared will still be cleared.
>> +      */
>> +     if (record->xl_info & XLR_BKP_BLOCK_1)
>> +             RestoreBkpBlocks(lsn, record, false);
>> +     else
>> +     {
>> +             Relation        reln;
>> +             Buffer          vmbuffer = InvalidBuffer;
>> +
>> +             reln = CreateFakeRelcacheEntry(xlrec->node);
>> +             visibilitymap_pin(reln, xlrec->block, &vmbuffer);
>> +             /* Don't set the bit if replay has already passed this point. */
>> +             if (XLByteLE(lsn, PageGetLSN(BufferGetPage(vmbuffer))))
>> +                     visibilitymap_set(reln, xlrec->block, lsn, &vmbuffer);
>
> ... wouldn't it be better to do this unconditionally?  Some later record will
> unset it if needed, because all replay functions that clear the bit do so
> without consulting the vm page LSN.  On the other hand, the worst consequence
> of the way you've done it is VACUUM visiting the page one more time to set the
> bit.

Hmm, now that I look at it, I think this test is backwards too. I
think you might be right that it wouldn't hurt anything to go ahead
and set it, but I'm inclined to leave it in for now.

> I think it's worth noting, perhaps based on your explanation in the
> second-to-last paragraph of
> http://archives.postgresql.org/message-id/BANLkTi=b7jVmq6fA_EXLCYgzuyV1u9at4A@mail.gmail.com
> that the answer may be incorrect again after the recheck.  We don't care:
> redundant clearings of the visibility bit are no problem.

I added a comment. See what you think.

> Suppose you pin one VM page for a certain candidate heap page, then loop
> around to another candidate that has a different VM page.  I don't see
> anywhere that you ReleaseBuffer() the first vm page; am I missing it?  I
> didn't try to construct a test case, but a relation with a too-small free
> space block below 512 MiB and a sufficiently-large block above 512 MiB should
> tickle that case.  The function header comment should document that vmbuffer,
> if not InvalidBuffer, must be pinned and will be unpinned if modified (or
> whatever other semantics you choose).

visibilitymap_pin() releases the old pin if it decides to switch pages.

> Several lines above are inconsistent on internal tabs/spaces.

Fixed... I hope.

> The two paragraphs ending above are ready to be removed.

Done.

> This could just as well be an Assert.

Maybe, but I'm going to leave it the way it is for now. It's not the
only place in the code where we do stuff like that. The cost of
testing it in a non-assert-enabled build is quite small.

> This fills RelationGetBufferForTuple()'s needs, but the name doesn't seem to
> fit the task too well.  This function doesn't check the pin or that the buffer
> applies to the correct relation.  Consider naming it something like
> `visibilitymap_buffer_covers_block'.

I think that this may be related to the fact that visibilitymap_pin()
doesn't do what you might expect. But that name is pre-existing, so I
kept it, and I don't really want this to be something that sounds
totally unrelated.

> Might this be clearer as follows?

Yep, done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
visibility-map-v5.patch application/octet-stream 30.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-06-17 17:22:40 Re: ALTER TABLE lock strength reduction patch is unsafe
Previous Message Garick Hamlin 2011-06-17 17:01:05 9.1beta2 / UNLOGGED + CHECK + INHERITS