Re: Enabling Checksums

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Enabling Checksums
Date: 2013-03-19 02:13:59
Message-ID: 5147C9E7.4090608@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/18/13 5:36 PM, Daniel Farina wrote:
> Clarification, because I think this assessment as delivered feeds some
> unnecessary FUD about EBS:
>
> EBS is quite reliable. Presuming that all noticed corruptions are
> strictly EBS's problem (that's quite a stretch), I'd say the defect
> rate falls somewhere in the range of volume-centuries.

I wasn't trying to flog EBS as any more or less reliable than other
types of storage. What I was trying to emphasize, similarly to your
"quite a stretch" comment, was the uncertainty involved when such
deployments fail. Failures happen due to many causes outside of just
EBS itself. But people are so far removed from the physical objects
that fail, it's harder now to point blame the right way when things fail.

A quick example will demonstrate what I mean. Let's say my server at
home dies. There's some terrible log messages, it crashes, and when it
comes back up it's broken. Troubleshooting and possibly replacement
parts follow. I will normally expect an eventual resolution that
includes data like "the drive showed X SMART errors" or "I swapped the
memory with a similar system and the problem followed the RAM". I'll
learn something about what failed that I might use as feedback to adjust
my practices. But an EC2+EBS failure doesn't let you get to the root
cause effectively most of the time, and that makes people nervous.

I can already see "how do checksums alone help narrow the blame?" as the
next question. I'll post something summarizing how I use them for that
tomorrow, just out of juice for that tonight.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Darren Duncan 2013-03-19 02:23:08 Re: machine-parseable object descriptions
Previous Message Bruce Momjian 2013-03-19 01:17:57 Re: pg_upgrade segfaults when given an invalid PGSERVICE value