Re: corrupt pages detected by enabling checksums

From: Jim Nasby <jim(at)nasby(dot)net>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: corrupt pages detected by enabling checksums
Date: 2013-05-12 20:40:03
Message-ID: 518FFE23.8070102@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/9/13 5:18 PM, Jeff Davis wrote:
> On Thu, 2013-05-09 at 14:28 -0500, Jim Nasby wrote:
>> What about moving some critical data from the beginning of the WAL
>> record to the end? That would make it easier to detect that we don't
>> have a complete record. It wouldn't necessarily replace the CRC
>> though, so maybe that's not good enough.
>>
>> Actually, what if we actually *duplicated* some of the same WAL header
>> info at the end of the record? Given a reasonable amount of data that
>> would damn-near ensure that a torn record was detected, because the
>> odds of having the exact same sequence of random bytes would be so
>> low. Potentially even just duplicating the LSN would suffice.
>
> I think both of these ideas have some false positives and false
> negatives.
>
> If the corruption happens at the record boundary, and wipes out the
> special information at the end of the record, then you might think it
> was not fully flushed, and we're in the same position as today.
>
> If the WAL record is large, and somehow the beginning and the end get
> written to disk but not the middle, then it will look like corruption;
> but really the WAL was just not completely flushed. This seems pretty
> unlikely, but not impossible.
>
> That being said, I like the idea of introducing some extra checks if a
> perfect solution is not possible.

Yeah, I don't think a perfect solution is possible, short of attempting to tie directly into the filesystem (ie: on a journaling FS have some way to essentially treat the FS journal as WAL).

One additional step we might be able to take would be to scan forward looking for a record that would tell us when an fsync must have occurred (heck, maybe we should add an fsync WAL record...). If we find a corrupt WAL record followed by an fsync we know that we've now lost data. That closes some of the holes. Actually, that might handle all the holes...

>> On the separate write idea, if that could be controlled by a GUC I
>> think it'd be worth doing. Anyone that needs to worry about this
>> corner case probably has hardware that would support that.
>
> It sounds pretty easy to do that naively. I'm just worried that the
> performance will be so bad for so many users that it's not a very
> reasonable choice.
>
> Today, it would probably make more sense to just use sync rep. If the
> master's WAL is corrupt, and it starts up too early, then that should be
> obvious when you try to reconnect streaming replication. I haven't tried
> it, but I'm assuming that it gives a useful error message.

I wonder if there are DW environments that are too large to keep a SR copy but would be able to afford the double-write overhead.

BTW, isn't performance what killed the double-buffer idea?
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2013-05-12 20:46:00 Re: corrupt pages detected by enabling checksums
Previous Message Jim Nasby 2013-05-12 20:18:35 Re: Proposal to add --single-row to psql