Re: crash-safe visibility map, take five

From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: crash-safe visibility map, take five
Date: 2011-06-18 13:37:03
Message-ID: 20110618133703.GA1100@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 17, 2011 at 01:21:17PM -0400, Robert Haas wrote:
> On Thu, Jun 16, 2011 at 11:17 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > I suggest revisiting the suggestion in
> > http://archives.postgresql.org/message-id/27743.1291135210@sss.pgh.pa.us and
> > including a latestRemovedXid in xl_heap_visible. ?The range of risky xids is
> > the same for setting a visibility map bit as for deleting a dead tuple, and
> > the same operation (VACUUM) produces both conflicts.
>
> See Heikki's follow-up email. The standby has to ignore all-visible
> bits anyway, because the fact that a transaction is all-visible on the
> master does not imply that it is all-visible on the standby. So I
> don't think there's a problem here.

Indeed. If we conflicted on the xmin of the most-recent all-visible tuple
when setting the bit, all-visible on the standby would become a superset of
all-visible on the master, and we could cease to ignore the bits. This is an
excellent trade-off for masters that do UPDATE and DELETE, because they're
already conflicting (or using avoidance measures) on similar xids at VACUUM
time. (Some INSERT-only masters will disfavor the trade-off, but we could
placate them with a GUC. On the other hand, hot_standby_feedback works and
costs an INSERT-only master so little.)

> > I proceeded to do some immediate-shutdown tests to see if we get the bits set
> > as expected. ?I set up a database like this:
> > ? ? ? ?CREATE EXTENSION pageinspect;
> > ? ? ? ?CREATE TABLE t (c int);
> > ? ? ? ?INSERT INTO t VALUES (2);
> > ? ? ? ?CHECKPOINT;
> > I normally cleared bits with "UPDATE t SET c = 1; CHECKPOINT;" and set them
> > with "VACUUM t". ?I checked bits with this query:
> > ? ? ? ?SELECT to_hex(get_byte(get_raw_page('t', 'vm', 0), 24)),
> > ? ? ? ? ? ? ? to_hex(flags::int) FROM page_header(get_raw_page('t', 0));
> > The row from that query should generally be 0,1 (bits unset) or 1,5 (bits
> > set). ?0,5 is fine after a crash. ?1,1 means we've broken our contract: the VM
> > bit is set while PD_ALL_VISIBLE is not set.
> >
> > First test was to clear bits, checkpoint, then VACUUM and SIGQUIT the
> > postmaster. ?The system came back up with 1/1 bits. ?I poked around enough to
> > see that XLByteLE(lsn, PageGetLSN(page)) was failing, but I did not dig any
> > deeper toward a root cause. ?If you have trouble reproducing this, let me know
> > so I can assemble a complete, self-contained test case.
>
> Thank you for putting these test cases together. As you can probably
> tell, I was having difficulty figuring out exactly how to test this.

No problem. I ran these test cases with the new patch:

-Description- -Expected bits- -Result-
set bit, VACUUM commit, crash 1,5 1,5
set bit, crash 0,1 0,1
set bit, flush heap page, crash 0,5 0,5
set bit, flush vm page, crash 1,5 1,5

xlog flush request locations look correct, too. Overall, looking good. Do
you know of other notable cases to check? The last two were somewhat tricky
to inject; if anyone wishes to reproduce them, I'm happy to share my recipe.

One more thing came to mind: don't we need a critical section inside the "if"
block in visibilitymap_set()? I can't see anything in there that could
elog(ERROR, ...), but my general impression is that we use one anyway.

I ran one repetition of my benchmark from last report, and the amount of WAL
has not changed. Given that, I think the previous runs remain valid.

> I think that the problem here is that the sense of that test is
> exactly backwards from what it should be. IIUC, the word "precedes"
> in the previous comment should in fact say "follows".

Ah; quite so.

> >> + ? ? ? ? ? ? /* Don't set the bit if replay has already passed this point. */
> >> + ? ? ? ? ? ? if (XLByteLE(lsn, PageGetLSN(BufferGetPage(vmbuffer))))
> >> + ? ? ? ? ? ? ? ? ? ? visibilitymap_set(reln, xlrec->block, lsn, &vmbuffer);
> >
> > ... wouldn't it be better to do this unconditionally? ?Some later record will
> > unset it if needed, because all replay functions that clear the bit do so
> > without consulting the vm page LSN. ?On the other hand, the worst consequence
> > of the way you've done it is VACUUM visiting the page one more time to set the
> > bit.
>
> Hmm, now that I look at it, I think this test is backwards too. I
> think you might be right that it wouldn't hurt anything to go ahead
> and set it, but I'm inclined to leave it in for now.

Okay. Consider expanding the comment to note that this is out of conservatism
rather than known necessity.

> > I think it's worth noting, perhaps based on your explanation in the
> > second-to-last paragraph of
> > http://archives.postgresql.org/message-id/BANLkTi=b7jVmq6fA_EXLCYgzuyV1u9at4A@mail.gmail.com
> > that the answer may be incorrect again after the recheck. ?We don't care:
> > redundant clearings of the visibility bit are no problem.
>
> I added a comment. See what you think.

That should help.

> > Several lines above are inconsistent on internal tabs/spaces.
>
> Fixed... I hope.

Probably not worth mentioning, but there remains one space instead of one tab
after "visibilitymap_clear" in this line:
* visibilitymap_clear - clear a bit in the visibility map

FWIW, I just do C-s TAB in emacs and then scan for places where the boundaries
of the highlighted regions fail to line up vertically.

> > This could just as well be an Assert.
>
> Maybe, but I'm going to leave it the way it is for now. It's not the
> only place in the code where we do stuff like that. The cost of
> testing it in a non-assert-enabled build is quite small.

Okay.

> > This fills RelationGetBufferForTuple()'s needs, but the name doesn't seem to
> > fit the task too well. ?This function doesn't check the pin or that the buffer
> > applies to the correct relation. ?Consider naming it something like
> > `visibilitymap_buffer_covers_block'.
>
> I think that this may be related to the fact that visibilitymap_pin()
> doesn't do what you might expect. But that name is pre-existing, so I
> kept it, and I don't really want this to be something that sounds
> totally unrelated.

Ah, so visibilitymap_pin_ok() answers the question "Will visibilitymap_pin()
be a long-winded no-op given these arguments?"

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2011-06-18 13:39:32 Re: Re: [COMMITTERS] pgsql: Don't use "cp -i" in the example WAL archive_command.
Previous Message Bruce Momjian 2011-06-18 13:34:58 Re: pg_upgrade using appname to lock out other users