Re: BUG #5055: Invalid page header error

Lists: pgsql-bugs
From: "john martin" <postgres_bee(at)live(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #5055: Invalid page header error
Date: 2009-09-14 23:17:49
Message-ID: 200909142317.n8ENHnsA004071@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


The following bug has been logged online:

Bug reference: 5055
Logged by: john martin
Email address: postgres_bee(at)live(dot)com
PostgreSQL version: 8.3.6
Operating system: Centos 5.2 32 bit
Description: Invalid page header error
Details:

All of a sudden we started seeing page header errors in certain queries. The
messages are in the form of "ERROR: invalid page header in block xxxx of
relation xxxx". The query fails. I found may previous messages in the
archives. Most, if not all, replies seemed to indicate hardware errors. I
have run all the disk/memory tests like fsync and memtest86 but nothing was
found. I have also rebooted is multiple times .

I found an unsatisfactory work around that causes, ahem, data loss. We went
ahead with it anyway fortunately because the error happened in our dev
environment. IOW, we could tolerate the data loss. The work around consists
of adding the following parameter to postgresql.conf and restarting
postgres.
"zero_damaged_pages=TRUE"

We no longer see the error messages with the above work around. Needless to
say, the work around cannot be used in production. But the database is
running on the SAME HARDWARE. Is it possible that it is a postgres bug?

I found the issue reported 5 years back to my surprise.
http://archives.postgresql.org/pgsql-hackers/2004-09/msg00869.php

I am urging the community to investigate the possibility that it may not be
hardware related, especially since it was first reported at least 5 years
back. Or may be you have decided not to fix if the number of people
reporting is very less. I have a very good opinion of postgres quality.
While I am not 100% sure it is a bug (only circumstantial evidence), I do
think it improves the product quality to fix an annoying old bug.


From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: john martin <postgres_bee(at)live(dot)com>
Cc: PostgreSQL bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5055: Invalid page header error
Date: 2009-09-15 05:43:15
Message-ID: 1252993395.27254.16.camel@wallace.localnet
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Mon, 2009-09-14 at 23:17 +0000, john martin wrote:
> All of a sudden we started seeing page header errors in certain queries.

Was there any particular event that marked the onset of these issues?

Anything in the system logs (dmesg / syslog etc) around that time?

[for SATA disks]: does smartctl from the smartmontools package indicate
anything interesting about the disk(s)? (Ignore the "health status",
it's a foul lie, and rely on the error log plus the vendor attributes:
reallocated sector count, pending sector, uncorrectable sector count,
etc).

Was Pg forcibly killed and restarted, or the machine hard-reset? (This
_shouldn't_ cause data corruption, but might give some starting point
for looking for a bug).

> I am urging the community to investigate the possibility that it may not be
> hardware related, especially since it was first reported at least 5 years
> back.

If anything, the fact that it was first reported 5 years back makes it
_more_ likely to be hardware related. Bad hardware eats/scrambles some
of your data; Pg goes "oh crap, that page is garbage". People aren't
constantly getting their data eaten, though, despite the age of the
initial reports.

It's not turning up lots. It's not turning up in cases where hardware
issues can be ruled out. There doesn't seem to be a strong pattern
associating issues to a particular CPU / disk controller / drive etc to
suggest it could be Pg triggering a hardware bug or a bug in Pg
triggered by a hardware quirk. It doesn't seem to be reproducible and
people generally don't seem to be able to trigger the issue repeatedly.
Either it's a *really* rare and quirky bug that's really hard to
trigger, or it's a variety of hardware / disk issues.

If it's a really rare and quirky hard to trigger bug, where do you even
start looking without *some* idea what happened to trigger the issue? Do
you have any idea what might've started it in your case?

*** DID YOU TAKE COPIES OF YOUR DATA FILES BEFORE "FIXING" THEM *** ?

--
Craig Ringer


From: John R Pierce <pierce(at)hogranch(dot)com>
To: PostgreSQL bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5055: Invalid page header error
Date: 2009-09-15 05:58:35
Message-ID: 4AAF2D0B.1080208@hogranch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Craig Ringer wrote:
> [for SATA disks]: does smartctl from the smartmontools package indicate
> anything interesting about the disk(s)? (Ignore the "health status",
> it's a foul lie, and rely on the error log plus the vendor attributes:
> reallocated sector count, pending sector, uncorrectable sector count,
> etc).
>

and, if you're doing RAID with desktop grade disks, its quite possible
for the drive to spontaneously decide a sector error requires a data
relocation but not have the 'good' data to relocate, and not return an
error code in time for the RAID controller or host md-raid to do
anything about it. this results in a very sneaky sort of data
corruption which goes undetected until some time later.

this is the primary reason to use the premium "ES" grade SATA drives
rather than the cheaper desktop stuff in a raid, they return sector
errors in a timely fashion rather than retrying for many minutes in the
background.


From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: John R Pierce <pierce(at)hogranch(dot)com>
Cc: PostgreSQL bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5055: Invalid page header error
Date: 2009-09-15 06:32:06
Message-ID: 1252996326.27254.21.camel@wallace.localnet
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Mon, 2009-09-14 at 22:58 -0700, John R Pierce wrote:

> and, if you're doing RAID with desktop grade disks, its quite possible
> for the drive to spontaneously decide a sector error requires a data
> relocation but not have the 'good' data to relocate, and not return an
> error code in time for the RAID controller or host md-raid to do
> anything about it. this results in a very sneaky sort of data
> corruption which goes undetected until some time later.
>
>
> this is the primary reason to use the premium "ES" grade SATA drives
> rather than the cheaper desktop stuff in a raid, they return sector
> errors in a timely fashion rather than retrying for many minutes in the
> background.

Ugh, really?

What do the desktop drives return in the mean time, when they haven't
been able to read a sector properly? Make something up and hope it gets
written to soon? That seems too hacky even for desktop HDD firmware,
which is saying something.

I've generally seen fairly prompt failure responses from desktop-grade
drives (and I see a lot of them fail!). While there are usually many
layers of OS-driven retries above the drive that delay reporting of
errors, the RAID volume the drive is a member of will generally block
until a retry succeeds or the OS layers between the software RAID
implementation and the disk give up and pass on the disk's error report.
That said, I've mostly used Linux's `md' software RAID, which while
imperfect seems to be pretty sane in terms of data preservation.

--
Craig Ringer


From: postgres bee <postgres_bee(at)live(dot)com>
To: <craig(at)postnewspapers(dot)com(dot)au>
Cc: <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5055: Invalid page header error
Date: 2009-09-15 21:11:47
Message-ID: BLU139-W213E0949C48D1EA6ABAEA285E30@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs


> Was there any particular event that marked the onset of these issues?
> Anything in the system logs (dmesg / syslog etc) around that time?

Unfortunately, cannot recall anything abnormal.

> Was Pg forcibly killed and restarted, or the machine hard-reset? (This
> _shouldn't_ cause data corruption, but might give some starting point
> for looking for a bug).
It is a normal pg shutdown followed by a reboot.

> *** DID YOU TAKE COPIES OF YOUR DATA FILES BEFORE "FIXING" THEM *** ?
No. Should have taken a backup on hindsight. At that time, the primary motivation was to get over the hump and have the query result succssfully.

Correct me if I am wrong, but I thought one of the, if not the most, primary tasks for relational databases is to ensure that no data loss ever occurs. Which is why I was initially surprised that the issue did not get enough importnace. But now it seems more like the community not knowing what triggered the issue i.e. not knowing which component to fix.

But I do have one overriding question - since postgres is still running on the same hardware, wouldn't it rule out hardware as the primary suspect?

_________________________________________________________________
Ready for Fall shows? Use Bing to find helpful ratings and reviews on digital tv's.
http://www.bing.com/shopping/search?q=digital+tv's&form=MSHNCB&publ=WLHMTAG&crea=TEXT_MSHNCB_Vertical_Shopping_DigitalTVs_1x1


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: postgres bee <postgres_bee(at)live(dot)com>
Cc: craig(at)postnewspapers(dot)com(dot)au, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5055: Invalid page header error
Date: 2009-09-15 23:21:06
Message-ID: 25234.1253056866@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

postgres bee <postgres_bee(at)live(dot)com> writes:
> But I do have one overriding question - since postgres is still running on the same hardware, wouldn't it rule out hardware as the primary suspect?

Uh, no. One of the principal characteristics of hardware problems is
that they're intermittent. (When they're not, you have a totally dead
machine...)

There may in fact be a software bug here, but you've not only provided
no evidence for that, you've gone out of your way to destroy whatever
evidence might have existed. We are not going to sit around wringing
our hands over the remote possibility of a bug, when there is nothing
we can do to investigate.

If you see it happen again, make a full filesystem-level copy of the
data directory (after shutting down the postmaster and before trying
to recover), and then maybe there's a chance to find something out.

regards, tom lane


From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: postgres bee <postgres_bee(at)live(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5055: Invalid page header error
Date: 2009-09-16 00:23:27
Message-ID: 4AB02FFF.1050408@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

postgres bee wrote:

> Correct me if I am wrong, but I thought one of the, if not the most, primary tasks for relational databases is to ensure that no data loss ever occurs. Which is why I was initially surprised that the issue did not get enough importnace. But now it seems more like the community not knowing what triggered the issue i.e. not knowing which component to fix.

... or if there is anything to fix.

PostgreSQL has to trust the hardware and the OS to do their jobs. If the
OS is, unbeknownst to PostgreSQL, flipping the high bit in any byte
written at exactly Midnight on a tuesday, there's nothing PostgreSQL can
do to prevent it.

If Pg checksummed blocks and read each block back after writing it could
possibly detect *immediate* write problems - but then, OS caching would
probably hide the issue unless Pg bypassed the OS's caching and forced
direct disk reads. To say this would perform poorly is a spectacular
understatement.

Even with such a scheme, there's no guarantee that data isn't being
mangled after it hits disk. The RAID controller might be "helpfully"
"fixing" parity errors in a RAID 5 volume using garbage being returned
by a failing disk during periodic RAID scrubbing. An SSD might have a
buggy wear leveling algorithm that results in blocks being misplaced.
And so on.

Now, in those cases you might see some benefit from an OS-level
checksumming file system, but that won't save you from OS bugs.

It's the OS and the hardware's job to get the data Pg writes to disk
onto disk successfully and accurately, keep it there, and return it
unchanged on request. If they can't do that, there's nothing Pg can do
about it.

> But I do have one overriding question - since postgres is still running on the same hardware, wouldn't it rule out hardware as the primary suspect?

Absolutely not. As Tom Lane noted, such faults are generally intermittent.

For example: I had lots of "fun" years ago tracking down an issue caused
by RAID scrubbing on a defective 3Ware 8500-8 card. The card ran fine in
all my tests, and the system would remain in good condition for a week
or two, but then random file system corruption would start arising.
Files would be filled with garbage or with the contents of other files,
the file system structure would get damaged and need fsck, files would
vanish or turn up in lost+found, etc etc. It turned out that by default
the controller ran a weekly parity check - which was always failing due
to an defect with the controller, triggering a rebuild. The rebuild, due
to the same issue with the controller, would proceed to merrily mangle
the data on the array in the name of restoring parity.

3Ware replaced the controller and all was well.

Now, what's PostgreSQL going to do when it's run on hardware like that?
How can it protect its self?

It can't.

Common causes of intermittent corruption include:

- OS / file system bugs

- Buggy RAID drivers and cards, especially "fake raid" cards

- Physically defective or failing hardware RAID cards

- Defective or overheating memory / CPU, resulting in intermittent
memory corruption that can affect data written to or read from disk.
Doesn't always show up as crashing processes etc as well; such things
can be REALLY quirky.

- Some desktop hard disks, which are so desperate to ensure you don't
return them as defective that they'll do scary things to remap blocks.
"Meh, it was unreadable anyway, I'll just re-allocate it and return
zeroes instead of reporting an error"

Sometimes bugs will only arise in certain circumstances. A RAID
controller bug might only be triggered by a Western Digital "Green" hard
disk with a 1.0 firmware*. An issue with a 2.5" laptop SSD might only
arise when a write is committed to it immediately before it's powered
off as a laptop goes into sleep. A buggy disk might perform an
incomplete write of a block if power from the PSU momentarily drops
below optimal levels because someone turned on the microwave on the same
phase as the server. The list is endless.

What it comes down to, though, is that this issue manifests its self
first as some corrupt blocks in one of the database segments. There's
absolutely no information available about when they got corrupted or by
what part of the system. It could've even been anti-virus software on
the system "disinfecting" them from a suspected virus, ie something
totally outside the normal parts of the system Pg is concerned with. So,
unless an event is noticed that is associated with the corruption, or
some way to reproduce it is found, there's no way to tell whether any
given incident could be a rarely triggered Pg bug (ie: Pg writes wrong
data, writes garbage to files, etc) or whether it's something external
like hardware or interfering 3rd party software.

Make sense?

--
Craig Ringer

* For example, WD Caviar disks a few years ago used to spin down without
request from the OS as a power saving measure. This was Ok with most
OSes, but RAID cards tended to treat them as failed and drop them from
the array. Multiple disk failure array death quickly resulted. Yes, I
had some of those, too - I haven't been lucky with disks.

--
Craig Ringer


From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: postgres bee <postgres_bee(at)live(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5055: Invalid page header error
Date: 2009-09-16 00:49:38
Message-ID: 4AB03622.5030708@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Craig Ringer wrote:
> PostgreSQL has to trust the hardware and the OS to do their jobs. If the
> OS is, unbeknownst to PostgreSQL, flipping the high bit in any byte

Might not even be the OS - it could be the stars (through cosmic rays).

http://www.eetimes.com/news/98/1012news/ibm.html
'"This clearly indicates that because of cosmic rays,
for every 256 Mbytes of memory, you'll get one soft
error a month," said Tim Dell, senior design
engineer for IBM Microelectronics. '

> The RAID controller might be "helpfully" "fixing" parity errors
> in a RAID 5 volume using garbage being returned by a failing disk
> during periodic RAID scrubbing.

If your raid controller doesn't have ECC memory, and if IBM's
right about those soft error stats, it might be doing more
harm than good.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: postgres bee <postgres_bee(at)live(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5055: Invalid page header error
Date: 2009-09-16 01:50:45
Message-ID: 26633.1253065845@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes:
> [ long summary of all the weird and wonderful ways things can break ]
> ... unless an event is noticed that is associated with the corruption, or
> some way to reproduce it is found, there's no way to tell whether any
> given incident could be a rarely triggered Pg bug (ie: Pg writes wrong
> data, writes garbage to files, etc) or whether it's something external
> like hardware or interfering 3rd party software.

Sometimes you can get a good clue by examining the putatively damaged
blocks. For instance, we've seen cases where a block in a Postgres data
file was reported corrupt, and turned out to contain text out of the
system's mail spool. That's a pretty strong hint that either the
filesystem or the disk drive messed up and wrote a chunk of a file at
the wrong place. Isolated flipped bits would suggest memory problems
(per the cosmic-rays issue). And so on. Postgres bugs tend to have
fairly recognizable signatures too. But with no evidence to look at,
there's nothing we can do except speculate.

regards, tom lane