Re: Enabling Checksums

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Enabling Checksums
Date: 2012-12-18 20:49:03
Message-ID: 1355863743.24766.196.camel@sussancws0025
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2012-12-18 at 08:17 +0000, Simon Riggs wrote:
> I think we should discuss whether we accept my premise? Checksums will
> actually detect more errors than we see now, and people will want to
> do something about that. Returning to backup is one way of handling
> it, but on a busy production system with pressure on, there is
> incentive to implement a workaround, not a fix. It's not an easy call
> to say "we've got 3 corrupt blocks, so I'm going to take the whole
> system offline while I restore from backup".

Up until now, my assumption has generally been that, upon finding the
corruption, the primary course of action is taking that server down
(hopefully you have a good replica), and do some kind of restore or sync
a new replica.

It sounds like you are exploring other possibilities.

> > I suppose we could have a new ReadBufferMaybe function that would only
> > be used by a sequential scan; and then just skip over the page if it's
> > corrupt, depending on a GUC. That would at least allow sequential scans
> > to (partially) work, which might be good enough for some data recovery
> > situations. If a catalog index is corrupted, that could just be rebuilt.
> > Haven't thought about the details, though.
>
> Not sure if you're being facetious here or not.

No. It was an incomplete thought (as I said), but sincere.

> Mild reworking of the
> logic for heap page access could cope with a NULL buffer response and
> subsequent looping, which would allow us to run pg_dump against a
> damaged table to allow data to be saved, keeping file intact for
> further analysis.

Right.

> I'm suggesting we work a little harder than "your block is corrupt"
> and give some thought to what the user will do next. Indexes are a
> good case, because we can/should report the block error, mark the
> index as invalid and then hint that it should be rebuilt.

Agreed; this applies to any derived data.

I don't think it will be very practical to keep a server running in this
state forever, but it might give enough time to reach a suitable
maintenance window.

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2012-12-18 20:52:25 Re: Enabling Checksums
Previous Message Kohei KaiGai 2012-12-18 20:39:24 Re: Review of Row Level Security