Re: Enabling Checksums

From: Jim Nasby <jim(at)nasby(dot)net>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Enabling Checksums
Date: 2013-03-23 04:19:51
Message-ID: 514D2D67.2070404@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I realize Simone relented on this, but FWIW...

On 3/16/13 4:02 PM, Simon Riggs wrote:
> Most other data we store doesn't consist of
> large runs of 0x00 or 0xFF as data. Most data is more complex than
> that, so any runs of 0s or 1s written to the block will be detected.
...

It's not that uncommon for folks to have tables that have a bunch of int[2,4,8]s all in a row, and I'd bet it's not uncommon for a lot of those fields to be zero.

> Checksums are for detecting problems. What kind of problems? Sporadic
> changes of bits? Or repeated errors. If we were trying to trap
> isolated bit changes then CRC-32 would be appropriate. But I'm
> assuming that whatever causes the problem is going to recur,

That's opposite to my experience. When we've had corruption events we will normally have one to several blocks with problems how up essentially all at once. Of course we can't prove that all the corruption happened at exactly the same time, but I believe it's a strong possibility. If it wasn't exactly the same time it was certainly over a span of minutes to hours... *but* we've never seen new corruption occur after we start an investigation (we frequently wait several hours for the next time we can take an outage without incurring a huge loss in revenue). That we would run for a number of hours with no additional corruption leads me to believe that whatever caused the corruption was essentially a "one-time" [1] event.

[1] One-time except for the fact that there were several periods where we would have corruption occur in 12 or 6 month intervals.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2013-03-23 04:26:27 Re: Enabling Checksums
Previous Message Jim Nasby 2013-03-23 04:04:27 Re: Page replacement algorithm in buffer cache