Re: Page Checksums

Lists: pgsql-hackers
From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <greg(at)2ndQuadrant(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-19 19:16:12
Message-ID: 4EEF391C0200002500043DF8@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <greg(at)2ndQuadrant(dot)com> wrote:

> 2) Rework hint bits to make the torn page problem go away.
> Checksums go elsewhere? More WAL logging to eliminate the bad
> situations? Eliminate some types of hint bit writes? It seems
> every alternative has trade-offs that will require serious
> performance testing to really validate.

I'm wondering whether we're not making a mountain out of a mole-hill
here. In real life, on one single crash, how many torn pages with
hint-bit-only updates do we expect on average? What's the maximum
possible? In the event of a crash recovery, can we force all tables
to be seen as needing autovacuum? Would there be a way to limit
this to some subset which *might* have torn pages somehow?

It seems to me that on a typical production system you would
probably have zero or one such page per OS crash, with zero being
far more likely than one. If we can get that one fixed (if it
exists) before enough time has elapsed for everyone to forget the OS
crash, the idea that we would be scaring the users and negatively
affecting the perception of reliability seems far-fetched. The fact
that they can *have* page checksums in PostgreSQL should do a lot to
*enhance* the PostgreSQL reputation for reliability in some circles,
especially those getting pounded with FUD from competing products.
If a site has so many OS or hardware failures that they lose track
-- well, they really should be alarmed.

Of course, the fact that you may hit such a torn page in a situation
where all data is good means that it shouldn't be more than a
warning.

This seems as though it eliminates most of the work people have been
suggesting as necessary, and makes the submitted patch fairly close
to what we want.

-Kevin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: greg(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-19 19:29:24
Message-ID: CA+TgmoaZOTQ=xNhPJyRASTX66EYvO5zHonAqbiMCt_ZYzNt6Lg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 19, 2011 at 2:16 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> It seems to me that on a typical production system you would
> probably have zero or one such page per OS crash, with zero being
> far more likely than one.  If we can get that one fixed (if it
> exists) before enough time has elapsed for everyone to forget the OS
> crash, the idea that we would be scaring the users and negatively
> affecting the perception of reliability seems far-fetched.

The problem is that you can't "fix" them. If you come to a page with
a bad CRC, you only have two choices: take it seriously, or don't. If
you take it seriously, then you're complaining about something that
may be completely benign. If you don't take it seriously, then you're
ignoring something that may be a sign of data corruption.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: <greg(at)2ndquadrant(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-19 19:44:43
Message-ID: 4EEF3FCB0200002500043E0F@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Dec 19, 2011 at 2:16 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> It seems to me that on a typical production system you would
>> probably have zero or one such page per OS crash, with zero being
>> far more likely than one. If we can get that one fixed (if it
>> exists) before enough time has elapsed for everyone to forget the
>> OS crash, the idea that we would be scaring the users and
>> negatively affecting the perception of reliability seems
>> far-fetched.
>
> The problem is that you can't "fix" them. If you come to a page
> with a bad CRC, you only have two choices: take it seriously, or
> don't. If you take it seriously, then you're complaining about
> something that may be completely benign. If you don't take it
> seriously, then you're ignoring something that may be a sign of
> data corruption.

I was thinking that we would warn when such was found, set hint bits
as needed, and rewrite with the new CRC. In the unlikely event that
it was a torn hint-bit-only page update, it would be a warning about
something which is a benign side-effect of the OS or hardware crash.
The argument was that it could happen months later, and people
might not remember the crash. My response to that is: don't let it
wait that long. By forcing a vacuum of all possibly-affected tables
(or all tables if the there's no way to rule any of them out), you
keep it within recent memory.

Of course, it would also make sense to document that such an error
after an OS or hardware crash might be benign or may indicate data
corruption or data loss, and give advice on what to do. There is
obviously no way for PostgreSQL to automagically "fix" real
corruption flagged by a CRC failure, under any circumstances.
There's also *always" a possibility that CRC error is a false
positive -- if only the bytes in the CRC were damaged. We're
talking quantitative changes here, not qualitative.

I'm arguing that the extreme measures suggested to achieve the
slight quantitative improvements are likely to cause more problems
than they solve. A better use of resources to improve the false
positive numbers would be to be more aggressive about setting hint
bits -- perhaps when a page is written with any tuples with
transaction IDs before the global xmin, the hint bits should be set
and the CRC calculated before write, for example. (But that would
be a different patch.)

-Kevin


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-19 20:12:46
Message-ID: 4EEF9ABE.4060305@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/19/2011 02:44 PM, Kevin Grittner wrote:
> I was thinking that we would warn when such was found, set hint bits
> as needed, and rewrite with the new CRC. In the unlikely event that
> it was a torn hint-bit-only page update, it would be a warning about
> something which is a benign side-effect of the OS or hardware crash.
> The argument was that it could happen months later, and people
> might not remember the crash. My response to that is: don't let it
> wait that long. By forcing a vacuum of all possibly-affected tables
> (or all tables if the there's no way to rule any of them out), you
> keep it within recent memory.

Cleanup that requires a potentially unbounded in size VACUUM to sort out
doesn't sound like a great path to wander down. Ultimately any CRC
implementation is going to want a "scrubbing" feature like those found
in RAID arrays eventually, one that wanders through all database pages
looking for literal bitrot. And pushing in priority requests for things
to check to the top of its queue may end up being a useful feature
there. But if you need all that infrastructure just to get the feature
launched, that's a bit hard to stomach.

Also, as someone who follows Murphy's Law as my chosen religion, I would
expect this situation could be exactly how flaky hardware would first
manifest itself: server crash and a bad CRC on the last thing written
out. And in that case, the last thing you want to do is assume things
are fine, then kick off a VACUUM that might overwrite more good data
with bad. The sort of bizarre, "that should never happen" cases are the
ones I'd most like to see more protection against, rather than excusing
them and going on anyway.

> There's also *always" a possibility that CRC error is a false
> positive -- if only the bytes in the CRC were damaged. We're
> talking quantitative changes here, not qualitative.

The main way I expect to validate this sort of thing is with an as yet
unwritten function to grab information about a data block from a standby
server for this purpose, something like this:

Master: Computed CRC A, Stored CRC B; error raised because A!=B
Standby: Computed CRC C, Stored CRC D

If C==D && A==C, the corruption is probably overwritten bits of the CRC B.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Smith" <greg(at)2ndQuadrant(dot)com>
Cc: "Robert Haas" <robertmhaas(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-19 23:14:16
Message-ID: 4EEF70E80200002500043E37@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <greg(at)2ndQuadrant(dot)com> wrote:

> But if you need all that infrastructure just to get the feature
> launched, that's a bit hard to stomach.

Triggering a vacuum or some hypothetical "scrubbing" feature?

> Also, as someone who follows Murphy's Law as my chosen religion,

If you don't think I pay attention to Murphy's Law, I should recap
our backup procedures -- which involves three separate forms of
backup, each to multiple servers in different buildings, real-time,
plus idle-time comparison of the databases of origin to all replicas
with reporting of any discrepancies. And off-line "snapshot"
backups on disk at a records center controlled by a different
department. That's in addition to RAID redundancy and hardware
health and performance monitoring. Some people think I border on
the paranoid on this issue.

> I would expect this situation could be exactly how flaky hardware
> would first manifest itself: server crash and a bad CRC on the
> last thing written out. And in that case, the last thing you want
> to do is assume things are fine, then kick off a VACUUM that might
> overwrite more good data with bad.

Are you arguing that autovacuum should be disabled after crash
recovery? I guess if you are arguing that a database VACUUM might
destroy recoverable data when hardware starts to fail, I can't
argue. And certainly there are way too many people who don't ensure
that they have a good backup before firing up PostgreSQL after a
failure, so I can see not making autovacuum more aggressive than
usual, and perhaps even disabling it until there is some sort of
confirmation (I have no idea how) that a backup has been made. That
said, a database VACUUM would be one of my first steps after
ensuring that I had a copy of the data directory tree, personally.
I guess I could even live with that as recommended procedure rather
than something triggered through autovacuum and not feel that the
rest of my posts on this are too far off track.

> The main way I expect to validate this sort of thing is with an as
> yet unwritten function to grab information about a data block from
> a standby server for this purpose, something like this:
>
> Master: Computed CRC A, Stored CRC B; error raised because A!=B
> Standby: Computed CRC C, Stored CRC D
>
> If C==D && A==C, the corruption is probably overwritten bits of
> the CRC B.

Are you arguing we need *that* infrastructure to get the feature
launched?

-Kevin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: greg(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-20 13:36:27
Message-ID: CA+Tgmoau8hcHgZe0jrvp0eM4+DzzLTagt692SiOdo0ehpbx2Bw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 19, 2011 at 2:44 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> I was thinking that we would warn when such was found, set hint bits
> as needed, and rewrite with the new CRC.  In the unlikely event that
> it was a torn hint-bit-only page update, it would be a warning about
> something which is a benign side-effect of the OS or hardware crash.

But that's terrible. Surely you don't want to tell people:

WARNING: Your database is corrupted, or maybe not. But don't worry,
I modified the data block so that you won't get this warning again.

OK, I guess I'm not sure that you don't want to tell people that. But
*I* don't!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Christopher Browne <cbbrowne(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, greg(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-20 17:12:56
Message-ID: CAFNqd5WxiHtoF1DpdJ+nGzi=UXdsb0jXA90Jt2doX4aH5doArQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 20, 2011 at 8:36 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Dec 19, 2011 at 2:44 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> I was thinking that we would warn when such was found, set hint bits
>> as needed, and rewrite with the new CRC.  In the unlikely event that
>> it was a torn hint-bit-only page update, it would be a warning about
>> something which is a benign side-effect of the OS or hardware crash.
>
> But that's terrible.  Surely you don't want to tell people:
>
> WARNING:  Your database is corrupted, or maybe not.  But don't worry,
> I modified the data block so that you won't get this warning again.
>
> OK, I guess I'm not sure that you don't want to tell people that.  But
> *I* don't!

This seems to be a frequent problem with this whole "doing CRCs on pages" thing.

It's not evident which problems will be "real" ones. And in such
cases, is the answer to turf the database and recover from backup,
because of a single busted page? For a big database, I'm not sure
that's less scary than the possibility of one page having a
corruption.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, greg <greg(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-20 17:32:13
Message-ID: 1324402248-sup-6024@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Excerpts from Christopher Browne's message of mar dic 20 14:12:56 -0300 2011:

> It's not evident which problems will be "real" ones. And in such
> cases, is the answer to turf the database and recover from backup,
> because of a single busted page? For a big database, I'm not sure
> that's less scary than the possibility of one page having a
> corruption.

I don't think the problem is having one page of corruption. The problem
is *not knowing* that random pages are corrupted, and living in the fear
that they might be.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: <greg(at)2ndquadrant(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-20 17:36:32
Message-ID: 4EF073400200002500043E80@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Dec 19, 2011 at 2:44 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> I was thinking that we would warn when such was found, set hint
>> bits as needed, and rewrite with the new CRC. In the unlikely
>> event that it was a torn hint-bit-only page update, it would be a
>> warning about something which is a benign side-effect of the OS
>> or hardware crash.
>
> But that's terrible. Surely you don't want to tell people:
>
> WARNING: Your database is corrupted, or maybe not. But don't
> worry, I modified the data block so that you won't get this
> warning again.
>
> OK, I guess I'm not sure that you don't want to tell people that.
> But *I* don't!

Well, I would certainly change that to comply with standard message
style guidelines. ;-)

But the alternatives I've heard so far bother me more. It sounds
like the most-often suggested alternative is:

ERROR (or stronger?): page checksum failed in relation 999 page 9
DETAIL: This may not actually affect the validity of any tuples,
since it could be a flipped bit in the checksum itself or dead
space, but we're shutting you down just in case.
HINT: You won't be able to read anything on this page, even if it
appears to be well-formed, without stopping your database and using
some arcane tool you've never heard of before to examine and
hand-modify the page. Any query which accesses this table may fail
in the same way.

The warning level message will be followed by something more severe
if the page or a needed tuple is mangled in a way that it would not
be used. I guess the biggest risk here is that there is real damage
to data which doesn't generate a stronger response, and the users
are ignoring warning messages. I'm not sure what to do about that,
but the above error doesn't seem like the right solution.

Assuming we do something about the "torn page on hint-bit only
write" issue, by moving the hint bits to somewhere else or logging
their writes, what would you suggest is the right thing to do when a
page is read with a checksum which doesn't match page contents?

-Kevin


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Christopher Browne" <cbbrowne(at)gmail(dot)com>
Cc: "greg" <greg(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-20 17:38:44
Message-ID: 4EF073C40200002500043E86@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Christopher Browne's message of mar dic 20 14:12:56
> -0300 2011:
>
>> It's not evident which problems will be "real" ones. And in such
>> cases, is the answer to turf the database and recover from
>> backup, because of a single busted page? For a big database, I'm
>> not sure that's less scary than the possibility of one page
>> having a corruption.
>
> I don't think the problem is having one page of corruption. The
> problem is *not knowing* that random pages are corrupted, and
> living in the fear that they might be.

What would you want the server to do when a page with a mismatching
checksum is read?

-Kevin


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Christopher Browne" <cbbrowne(at)gmail(dot)com>, "greg" <greg(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Subject: Re: Page Checksums
Date: 2011-12-20 17:40:56
Message-ID: 201112201840.56440.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday, December 20, 2011 06:38:44 PM Kevin Grittner wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
> > Excerpts from Christopher Browne's message of mar dic 20 14:12:56
> >
> > -0300 2011:
> >> It's not evident which problems will be "real" ones. And in such
> >> cases, is the answer to turf the database and recover from
> >> backup, because of a single busted page? For a big database, I'm
> >> not sure that's less scary than the possibility of one page
> >> having a corruption.
> >
> > I don't think the problem is having one page of corruption. The
> > problem is *not knowing* that random pages are corrupted, and
> > living in the fear that they might be.
>
> What would you want the server to do when a page with a mismatching
> checksum is read?
Follow the behaviour of zero_damaged_pages.

Andres


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Christopher Browne <cbbrowne(at)gmail(dot)com>, greg <greg(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-20 17:59:31
Message-ID: CAC_2qU8xZvc9D_aBuOp8RZQjeGLd=+Ha5yD2MNy9WDJVXR=bQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 20, 2011 at 12:38 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

>> I don't think the problem is having one page of corruption.  The
>> problem is *not knowing* that random pages are corrupted, and
>> living in the fear that they might be.
>
> What would you want the server to do when a page with a mismatching
> checksum is read?

But that's exactly the problem. I don't know what I want the server
to do, because I don't know if the page with the checksum mismatch is
one of the 10GB of pages in the page cache that were dirty and poses 0
risk (i.e. hint-bit only changes made it dirty), a page that was
really messed up on the kernel panic that last happened causing this
whole mess, or an even older page that really is giving bitrot...

a.

--
Aidan Van Dyk                                             Create like a god,
aidan(at)highrise(dot)ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Christopher Browne" <cbbrowne(at)gmail(dot)com>, "greg" <greg(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Subject: Re: Page Checksums
Date: 2011-12-20 18:08:56
Message-ID: 9968.1324404536@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On Tuesday, December 20, 2011 06:38:44 PM Kevin Grittner wrote:
>> What would you want the server to do when a page with a mismatching
>> checksum is read?

> Follow the behaviour of zero_damaged_pages.

Surely not. Nobody runs with zero_damaged_pages turned on in
production; or at least, nobody with any semblance of a clue.

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Christopher Browne" <cbbrowne(at)gmail(dot)com>, "greg" <greg(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Subject: Re: Page Checksums
Date: 2011-12-20 18:13:54
Message-ID: 201112201913.54472.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday, December 20, 2011 07:08:56 PM Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On Tuesday, December 20, 2011 06:38:44 PM Kevin Grittner wrote:
> >> What would you want the server to do when a page with a mismatching
> >> checksum is read?
> >
> > Follow the behaviour of zero_damaged_pages.
>
> Surely not. Nobody runs with zero_damaged_pages turned on in
> production; or at least, nobody with any semblance of a clue.
Thats my point. There is no automated solution for page errors. So it should
ERROR (not PANIC) out in normal operation and be "fixable" via
zero_damaged_pages.
I personally wouldn't even have a problem making zero_damaged_pages only
applicable in single backend mode.

Andres


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-21 07:01:16
Message-ID: 4EF1843C.6090101@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/19/2011 06:14 PM, Kevin Grittner wrote:
>> But if you need all that infrastructure just to get the feature
>> launched, that's a bit hard to stomach.
>>
>
> Triggering a vacuum or some hypothetical "scrubbing" feature?
>

What you were suggesting doesn't require triggering just a vacuum
though--it requires triggering some number of vacuums, for all impacted
relations. You said yourself that "all tables if the there's no way to
rule any of them out" was a possibility. I'm just pointing out that
scheduling that level of work is a logistics headache, and it would be
reasonable for people to expect some help with that were it to become a
necessary thing falling out of the implementation.

> Some people think I border on the paranoid on this issue.

Those people are also out to get you, just like the hardware.

> Are you arguing that autovacuum should be disabled after crash
> recovery? I guess if you are arguing that a database VACUUM might
> destroy recoverable data when hardware starts to fail, I can't
> argue.

A CRC failure suggests to me a significantly higher possibility of
hardware likely to lead to more corruption than a normal crash does though.

>> The main way I expect to validate this sort of thing is with an as
>> yet unwritten function to grab information about a data block from
>> a standby server for this purpose, something like this:
>>
>> Master: Computed CRC A, Stored CRC B; error raised because A!=B
>> Standby: Computed CRC C, Stored CRC D
>>
>> If C==D&& A==C, the corruption is probably overwritten bits of
>> the CRC B.
>>
>
> Are you arguing we need *that* infrastructure to get the feature
> launched?
>

No; just pointing out the things I'd eventually expect people to want,
because they help answer questions about what to do when CRC failures
occur. The most reasonable answer to "what should I do about suspected
corruption on a page?" in most of the production situations I worry
about is "see if it's recoverable from the standby". I see this as
being similar to how RAID-1 works: if you find garbage on one drive,
and you can get a clean copy of the block from the other one, use that
to recover the missing data. If you don't have that capability, you're
stuck with no clear path forward when a CRC failure happens, as you
noted downthread.

This obviously gets troublesome if you've recently written a page out,
so there's some concern about whether you are checking against the
correct version of the page or not, based on where the standby's replay
is at. I see that as being a case that's also possible to recover from
though, because then the page you're trying to validate on the master is
likely sitting in the recent WAL stream. This is already the sort of
thing companies doing database recovery work (of which we are one) deal
with, and I doubt any proposal will cover every possible situation. In
some cases there may be no better answer than "show all the known
versions and ask the user to sort it out". The method I suggested would
sometimes kick out an automatic fix.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us


From: Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-21 08:32:28
Message-ID: 4EF1999C.4080301@yahoo.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such as
the one available in ZFS and btrfs?


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-21 15:19:10
Message-ID: 20111221151910.GG24234@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Leonardo Francalanci (m_lists(at)yahoo(dot)it) wrote:
> I can't help in this discussion, but I have a question:
> how different would this feature be from filesystem-level CRC, such
> as the one available in ZFS and btrfs?

Depends on how much you trust the filesystem. :)

Stephen


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Smith" <greg(at)2ndQuadrant(dot)com>
Cc: "Robert Haas" <robertmhaas(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-21 15:21:53
Message-ID: 4EF1A5310200002500043F37@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <greg(at)2ndQuadrant(dot)com> wrote:
>> Some people think I border on the paranoid on this issue.
>
> Those people are also out to get you, just like the hardware.

Hah! I *knew* it!

>> Are you arguing that autovacuum should be disabled after crash
>> recovery? I guess if you are arguing that a database VACUUM
>> might destroy recoverable data when hardware starts to fail, I
>> can't argue.
>
> A CRC failure suggests to me a significantly higher possibility
> of hardware likely to lead to more corruption than a normal crash
> does though.

Yeah, the discussion has me coming around to the point of view
advocated by Andres: that it should be treated the same as corrupt
pages detected through other means. But that can only be done if
you eliminate false positives from hint-bit-only updates. Without
some way to handle that, I guess that means the idea is dead.

Also, I'm not sure that our shop would want to dedicate any space
per page for this, since we're comparing between databases to ensure
that values actually match, row by row, during idle time. A CRC or
checksum is a lot weaker than that. I can see where it would be
very valuable where more rigorous methods aren't in use; but it
would really be just extra overhead with little or no benefit for
most of our database clusters.

-Kevin


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Greg Smith" <greg(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Subject: Re: Page Checksums
Date: 2011-12-21 15:29:45
Message-ID: 201112211629.45491.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday, December 21, 2011 04:21:53 PM Kevin Grittner wrote:
> Greg Smith <greg(at)2ndQuadrant(dot)com> wrote:
> >> Some people think I border on the paranoid on this issue.
> >
> > Those people are also out to get you, just like the hardware.
>
> Hah! I *knew* it!
>
> >> Are you arguing that autovacuum should be disabled after crash
> >> recovery? I guess if you are arguing that a database VACUUM
> >> might destroy recoverable data when hardware starts to fail, I
> >> can't argue.
> >
> > A CRC failure suggests to me a significantly higher possibility
> > of hardware likely to lead to more corruption than a normal crash
> > does though.
>
> Yeah, the discussion has me coming around to the point of view
> advocated by Andres: that it should be treated the same as corrupt
> pages detected through other means. But that can only be done if
> you eliminate false positives from hint-bit-only updates. Without
> some way to handle that, I guess that means the idea is dead.
>
> Also, I'm not sure that our shop would want to dedicate any space
> per page for this, since we're comparing between databases to ensure
> that values actually match, row by row, during idle time. A CRC or
> checksum is a lot weaker than that. I can see where it would be
> very valuable where more rigorous methods aren't in use; but it
> would really be just extra overhead with little or no benefit for
> most of our database clusters.
Comparing between database will by far not recognize failures in all data
because you surely will not use all indexes. With index only scans the
likelihood of unnoticed heap corruption also increases.
E.g. I have seen disk level corruption silently corrupting a unique index so
it didn't cover all data anymore which lead to rather big problems.
Not everyone can do regular dump+restore tests to protect against such
scenarios...

Andres


From: Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-21 15:32:53
Message-ID: 4EF1FC25.9000706@yahoo.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 21/12/2011 16.19, Stephen Frost wrote:
> * Leonardo Francalanci (m_lists(at)yahoo(dot)it) wrote:
>> I can't help in this discussion, but I have a question:
>> how different would this feature be from filesystem-level CRC, such
>> as the one available in ZFS and btrfs?
>
> Depends on how much you trust the filesystem. :)

Ehm I hope that was a joke...

I think what I meant was: isn't this going to be useless in a couple of
years (if, say, btrfs will be available)? Or it actually gives something
that FS will never be able to give?


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Greg Smith <greg(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-21 15:36:05
Message-ID: 4EF1FCE5.80401@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 21.12.2011 17:21, Kevin Grittner wrote:
> Also, I'm not sure that our shop would want to dedicate any space
> per page for this, since we're comparing between databases to ensure
> that values actually match, row by row, during idle time.

4 bytes out of a 8k block is just under 0.05%. I don't think anyone is
going to notice the extra disk space consumed by this. There's all those
other issues like the hint bits that make this a non-starter, but disk
space overhead is not one of them.

INHO we should just advise that you should use a filesystem with CRCs if
you want that extra level of safety. It's the hardware's and operating
system's job to ensure that data doesn't get corrupt after we hand it
over to the OS with write()/fsync().

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, greg(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-21 15:36:56
Message-ID: CA+TgmoZF54QDBvcN5pZ_3j5rBMVe9pT62==15QW8ucnz-_otCw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 20, 2011 at 12:12 PM, Christopher Browne <cbbrowne(at)gmail(dot)com> wrote:
> This seems to be a frequent problem with this whole "doing CRCs on pages" thing.
>
> It's not evident which problems will be "real" ones.

That depends on the implementation. If we have a flaky, broken
implementation such as the one proposed, then, yes, it will be
unclear. But if we properly guard against a torn page invalidating
the CRC, then it won't be unclear at all: any CRC mismatch means
something bad happened.

Of course, that may be fairly expensive in terms of performance. But
the only way I can see to get around that problem is to rewrite our
heap AM or our MVCC implementation in some fashion that gets rid of
hint bits.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-21 15:49:14
Message-ID: 20111221154914.GH24234@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Leonardo Francalanci (m_lists(at)yahoo(dot)it) wrote:
> >Depends on how much you trust the filesystem. :)
>
> Ehm I hope that was a joke...

It certainly wasn't..

> I think what I meant was: isn't this going to be useless in a couple
> of years (if, say, btrfs will be available)? Or it actually gives
> something that FS will never be able to give?

Yes, it will help you find/address bugs in the filesystem. These things
are not unheard of...

Thanks,

Stephen


From: Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-21 16:27:30
Message-ID: 4EF208F2.6030907@yahoo.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> I think what I meant was: isn't this going to be useless in a couple
>> of years (if, say, btrfs will be available)? Or it actually gives
>> something that FS will never be able to give?
>
> Yes, it will help you find/address bugs in the filesystem. These things
> are not unheard of...

It sounds to me like a huge job to fix some issues "not unheard of"...

My point is: if we are trying to fix misbehaving drives/controllers
(something that is more common than one might think), that's already
done by ZFS on Solaris and FreeBSD, and will be done in btrfs for linux.

I understand not trusting drives/controllers; but not trusting a
filesystem...

What am I missing? (I'm far from being an expert... I just don't
understand...)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Greg Smith <greg(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-21 16:34:08
Message-ID: 6408.1324485248@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> 4 bytes out of a 8k block is just under 0.05%. I don't think anyone is
> going to notice the extra disk space consumed by this. There's all those
> other issues like the hint bits that make this a non-starter, but disk
> space overhead is not one of them.

The bigger problem is that adding a CRC necessarily changes the page
format and therefore breaks pg_upgrade. As Greg and Simon already
pointed out upthread, there's essentially zero chance of this getting
applied before we have a solution that allows pg_upgrade to cope with
page format changes. A CRC feature is not compelling enough to justify
a non-upgradable release cycle.

regards, tom lane


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-21 19:35:50
Message-ID: 4EF23516.8050904@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/21/2011 10:49 AM, Stephen Frost wrote:
> * Leonardo Francalanci (m_lists(at)yahoo(dot)it) wrote:
>
>> I think what I meant was: isn't this going to be useless in a couple
>> of years (if, say, btrfs will be available)? Or it actually gives
>> something that FS will never be able to give?
>>
> Yes, it will help you find/address bugs in the filesystem. These things
> are not unheard of...
>

There was a spike in data recovery business here after people started
migrating to ext4. New filesystems are no fun to roll out; some bugs
will only get shaken out when brave early adopters deploy them.

And there's even more radical changes in btrfs, since it wasn't starting
with a fairly robust filesystem as a base. And putting my tin foil hat
on, I don't feel real happy about assuming *the* solution for this issue
in PostgreSQL is the possibility of a filesystem coming one day when
that work is being steered by engineers who work at Oracle.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page Checksums
Date: 2011-12-21 22:32:13
Message-ID: 20111221223212.GB18049@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 21, 2011 at 09:32:28AM +0100, Leonardo Francalanci wrote:
> I can't help in this discussion, but I have a question:
> how different would this feature be from filesystem-level CRC, such
> as the one available in ZFS and btrfs?

Hmm, filesystems are not magical. If they implement this then they will
have the same issues with torn pages as Postgres would. Which I
imagine they solve by doing a transactional update by writing the new
page to a new location, with checksum and updating a pointer. They
can't even put the checksum on the same page, like we could. How that
interacts with seqscans I have no idea.

Certainly I think we could look to them for implementation ideas, but I
don't imagine they've got something that can't be specialised for
better performence.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-21 23:13:17
Message-ID: CA+U5nMKG7E=ab4R44wR7tNQoqf0G-t1WR9=KoeNrnoTRiehO-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 21, 2011 at 7:35 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:

> And there's even more radical changes in btrfs, since it wasn't starting
> with a fairly robust filesystem as a base.  And putting my tin foil hat on,
> I don't feel real happy about assuming *the* solution for this issue in
> PostgreSQL is the possibility of a filesystem coming one day when that work
> is being steered by engineers who work at Oracle.

Agreed.

I do agree with Heikki that it really ought to be the OS problem, but
then we thought that about dtrace and we're still waiting for that or
similar to be usable on all platforms (+/- 4 years).

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-22 08:40:50
Message-ID: 4EF2ED12.5080802@yahoo.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Agreed.
>
> I do agree with Heikki that it really ought to be the OS problem, but
> then we thought that about dtrace and we're still waiting for that or
> similar to be usable on all platforms (+/- 4 years).

My point is that it looks like this is going to take 1-2 years in
postgresql, so it looks like a huge job... but at the same time I
understand we can't "hope other filesystems will catch up"!

I guess this feature will be tunable (off/on)?


From: Greg Stark <stark(at)mit(dot)edu>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: greg(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-25 22:18:51
Message-ID: CAM-w4HPUJC1mr1XxRBcykHVh8nSW9dWdK3LmQ=NrpNmeTQSF9w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 19, 2011 at 7:16 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> It seems to me that on a typical production system you would
> probably have zero or one such page per OS crash

Incidentally I don't think this is right. There are really two kinds
of torn pages:

1) The kernel vm has many dirty 4k pages and decides to flush one 4k
page of a Postgres 8k buffer but not the other one. It doesn't sound
very logical for it to do this but it has the same kind of tradeoffs
to make that Postgres does and there could easily be cases where the
extra book-keeping required to avoid it isn't deemed worthwhile. The
two memory pages might not even land on the same part of the disk
anyways so flushing one and not the other might be reasonable.

In this case there could be an unbounded number of such torn pages and
they can stay torn on disk for a long period of time so the torn pages
may not have been actively being written when the crash occurred. On
Linux these torn pages will always be on memory page boundaries -- ie
4k blocks on x86.

2) The i/o system was in the process of writing out blocks and the
system lost power or crashed as they were being written out. In this
case there will probably only be 0 or 1 torn pages -- perhaps as many
as the scsi queue depth if there's some weird i/o scheduling going on.
In this case the torn page could be on a hardware block boundary --
often 512 byte boundaries (or if the drives don't guarantee otherwise
it could corrupt a disk block).

--
greg


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, greg(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Page Checksums
Date: 2011-12-27 19:19:28
Message-ID: 1325013568.11655.10.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2011-12-25 at 22:18 +0000, Greg Stark wrote:
> 2) The i/o system was in the process of writing out blocks and the
> system lost power or crashed as they were being written out. In this
> case there will probably only be 0 or 1 torn pages -- perhaps as many
> as the scsi queue depth if there's some weird i/o scheduling going on.

That would also depend on how many disks you have and what configuration
they're in, right?

Regards,
Jeff Davis