Quick Links

Re: Enabling Checksums

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Enabling Checksums
Date:	2012-12-18 20:49:03
Message-ID:	1355863743.24766.196.camel@sussancws0025
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, 2012-12-18 at 08:17 +0000, Simon Riggs wrote:
> I think we should discuss whether we accept my premise? Checksums will
> actually detect more errors than we see now, and people will want to
> do something about that. Returning to backup is one way of handling
> it, but on a busy production system with pressure on, there is
> incentive to implement a workaround, not a fix. It's not an easy call
> to say "we've got 3 corrupt blocks, so I'm going to take the whole
> system offline while I restore from backup".

Up until now, my assumption has generally been that, upon finding the
corruption, the primary course of action is taking that server down
(hopefully you have a good replica), and do some kind of restore or sync
a new replica.

It sounds like you are exploring other possibilities.

> > I suppose we could have a new ReadBufferMaybe function that would only
> > be used by a sequential scan; and then just skip over the page if it's
> > corrupt, depending on a GUC. That would at least allow sequential scans
> > to (partially) work, which might be good enough for some data recovery
> > situations. If a catalog index is corrupted, that could just be rebuilt.
> > Haven't thought about the details, though.
>
> Not sure if you're being facetious here or not.

No. It was an incomplete thought (as I said), but sincere.

> Mild reworking of the
> logic for heap page access could cope with a NULL buffer response and
> subsequent looping, which would allow us to run pg_dump against a
> damaged table to allow data to be saved, keeping file intact for
> further analysis.

Right.

> I'm suggesting we work a little harder than "your block is corrupt"
> and give some thought to what the user will do next. Indexes are a
> good case, because we can/should report the block error, mark the
> index as invalid and then hint that it should be rebuilt.

Agreed; this applies to any derived data.

I don't think it will be very practical to keep a server running in this
state forever, but it might give enough time to reach a suitable
maintenance window.

Regards,
Jeff Davis

In response to

Re: Enabling Checksums at 2012-12-18 08:17:29 from Simon Riggs

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jeff Davis	2012-12-18 20:52:25	Re: Enabling Checksums
Previous Message	Kohei KaiGai	2012-12-18 20:39:24	Re: Review of Row Level Security