Re: Block-level CRC checks

From: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block-level CRC checks
Date: 2008-11-17 08:52:48
Message-ID: 7CE61C21-DA8B-4C7F-AC77-1E3B76E3BB0D@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

[sorry for top-posting - damn phone]

I thought of saying that too but it doesn't really solve the problem.
Think of what happens if someone sets a hint bit on a dirty page.

greg

On 17 Nov 2008, at 08:26 AM, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com
> wrote:

> Martijn van Oosterhout wrote:
>> On Fri, Nov 14, 2008 at 10:51:57AM -0500, Tom Lane wrote:
>>> In fact, if the patch were to break torn-page handling, it would be
>>> 100% likely to be a net *decrease* in system reliability. It
>>> would add
>>> detection of a situation that is not supposed to happen (ie, storage
>>> system fails to return the same data it stored) at the cost of
>>> breaking
>>> one's database when the storage system acts as it's expected and
>>> documented to in a routine power-loss situation.
>> Ok, I see it's a problem because the hint changes are not WAL logged,
>> so torn pages are expected to work in normal operation. But simply
>> skipping the hint bits during checksumming is a terrible solution,
>> since then any errors in those bits will go undetected. To not be
>> able
>> to say in the documentation that you'll detect 100% of single-bit
>> errors is pretty darn terrible, since that's kind of the goal of the
>> exercise.
>
> Agreed, trying to explain that in the documentation would look like
> making excuses.
>
> The requirement that all hint bit changes are WAL-logged seems like
> a pretty big change. I don't like doing that, just for CRCing.
>
> There has been discussion before about not writing out pages to disk
> that only have hint-bit updates on them. That means that the next
> time the page is read, the reader needs to do the clog lookups and
> set the hint bits again. It's a tradeoff, making the first SELECT
> after modifying a page cheaper, I/O-wise, at the cost of making all
> subsequent SELECTs that need to read the page from disk or kernel
> cache more expensive, CPU-wise.
>
> I'm not sure if I like that idea or not, but it would also solve the
> CRC problem with torn pages. FWIW, it would also solve the problem
> suggested with IBM DTLA disks and others that might zero-out a
> sector in case of an interrupted write. I'm not totally convinced
> that's a problem, as there's apparently other software that make the
> same assumption as we do, and we haven't heard of any torn-page
> corruption in real life, but still.
>
> If we made the behavior configurable, that would be pretty hard to
> explain in the docs. We'd have three options with dependencies
>
> - CRC on/off
> - write pages with only hint bit changes on/off
> - full_page_writes on/off
>
> If disable full_page_writes, you're vulnerable to torn pages. If you
> enable it, you're not. Except if you also turn CRC on. Except if you
> also turn "write pages with only hint bit changes" off.
>
>> Unfortunatly, there's not a lot of easy solutions here. You could do
>> two checksums, one with and one without hint bits. The overall
>> checksum
>> tells you if there's a problem. If it doesn't match the second
>> checksum
>> will tell you if it's the hint bits or not (torn page problem). If
>> it's
>> the hint bits you can reset them all and continue. The checksums need
>> not be of equal strength.
>
> Hmm, that would work I guess.
>
>> The extreme case is an ECC where you explicitly can set it so you can
>> alter N bits before you need to recalculate the checksum.
>> Computationally though, that sucks.
>
> Yep. Also, in case of a torn page, you're very likely going to have
> several hint bits from the old image and several from the new image.
> An error-correcting code would need to be unfeasibly long to cope
> with that.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2008-11-17 09:53:45 Re: patch: Client certificate requirements
Previous Message Bramandia Ramadhana 2008-11-17 08:47:18 Re: Stack trace