Re: corrupt pages detected by enabling checksums

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: corrupt pages detected by enabling checksums
Date: 2013-05-01 15:33:11
Message-ID: CA+TgmoZWeFxu81MUPvbDqgDRzcggUSL71XGOkWgrN93jEFmW2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 1, 2013 at 11:29 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I was worried because SyncOneBuffer checks whether it needs writing
>> without taking a content lock, so the exclusive lock doesn't help. That
>> makes sense, because you don't want a checkpoint to have to get a
>> content lock on every buffer in the buffer pool. But it also means we
>> need to follow the rules laid out in transam/README and dirty the pages
>> before writing WAL.
>
> On further review, I think you're right about this. We'd have a
> problem if the following sequence of events were to occur:
>
> 1. vacuum writes the WAL record, but does not yet mark the buffer
> dirty, and then goes into the tank
> 2. checkpointer decides where the checkpoint will begin
> 3. buffer pool is scanned as part of the checkpoint process, observing
> target buffer not to be dirty
> 4. vacuum finally wakes up and marks the buffer dirty
> 5. crash, after WAL is flushed but before buffer is written
>
> However, on looking at this a little more, shouldn't we be doing
> log_heap_clean() *before* we do visibilitymap_set()?

Hit send too soon.

To finish that thought: right now the order of operations is...

1. Perform the cleanup operations on the buffer.
2. Set the visibility map bit.
3. Log the visibility map change.
4. Log the cleanup operations.

It seems to me that it would be better to do (4) as close to (1) as
possible. Isn't it bad if the operations are replayed in an order
that differs from the order in which they were performed - or if, say,
the first WAL record were replayed but the second were not, for
whatever reason?

>>> The effect of the change to lazy_scan_heap is to force the buffer to
>>> be written even if we're only updating the visibility map page.
>>> That's a bad idea and should be reverted.
>>
>> The only time the VM and the data page are out of sync during vacuum is
>> after a crash, right? If that's the case, I didn't think it was a big
>> deal to dirty one extra page (should be extremely rare). Am I missing
>> something?
>>
>> The reason I removed that special case was just code
>> complexity/readability. I tried preserving the previous behavior, and
>> it's not so bad, but it seemed unnecessarily ugly for the benefit of a
>> rare case.
>
> Well, I don't find it that ugly, but then again

...I've done a fair amount of hacking on this code. The scenario that
I find problematic here is that you could easily lose a couple of
visibility map pages in a crash. Each one you lose, with this patch,
potentially involves half a gigabyte of unnecessary disk writes, if
the relation is being vacuumed at the time. For the few extra lines
of code it takes to protect against that scenario, I think we should
protect against it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-05-01 15:37:11 Re: Remaining beta blockers
Previous Message Robert Haas 2013-05-01 15:29:16 Re: corrupt pages detected by enabling checksums