Protecting against unexpected zero-pages: proposal

From: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
To: PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Protecting against unexpected zero-pages: proposal
Date: 2010-11-06 10:42:40
Message-ID: AANLkTimt2xZDDUiRqMS3aTTRxVQY6ALZNhF5ou1_736w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

A customer of ours is quite bothered about finding zero pages in an index
after
a system crash. The task now is to improve the diagnosability of such an
issue
and be able to definitively point to the source of zero pages.

The proposed solution below has been vetted in-house at EnterpriseDB and am
posting here to see any possible problems we missed, and also if the
community
would be interested in incorporating this capability.

Background:
-----------
SUSE Linux, ATCA board, 4 dual core CPUs => 8 cores, 24 GB RAM, 140 GB disk,
PG 8.3.11. RAID-1 SAS with SCSIinfo reporting that write-caching is
disabled.

The corrupted index's file contents, based on hexdump:

It has a total of 525 pages (cluster block size is 8K: per
pg_controldata)
Blocks 0 to 278 look sane.
Blocks 279 to 518 are full of zeroes.
Block 519 to 522 look sane.
Block 523 is filled with zeroes.
Block 524 looks sane.

The tail end of blocks 278 and 522 have some non-zero data, meaning that
those
index pages have some valid 'Special space' contents. Also, head of blocks
519
and 524 look sane. These two findings imply that the zeroing action happened
at
8K page boundary. This is a standard ext3 FS with 4K block size, so this
raises
question as to how we can ascertain that this was indeed a hardware/FS
malfunction. And if it was a hardware/FS problem, then why didn't we see
zeroes
at 1/2 K boundary (generally the disk's sector size) or 4K boundary (default
ext3 FS block size) which does not align with an 8 K boundary.

The backup from before the crash does not have these zero-pages.

Disk Page Validity Check Using Magic Number
===========================================

Requirement:
------------
We have encountered quite a few zero pages in an index after a machine
crash,
causing this index to be unusable. Although REINDEX is an option but we have
no way of telling if these zero pages were caused by hardware or filesystem
or
by Postgres. Postgres code analysis shows that Postgres being the culprit is
a
very low probablity, and similarly, since our hardware is also considered of
good quality with hardware level RAID-1 over 2 disks, it is difficult to
consider
the hardware to be a problem. The ext3 filesystem being used is also quite a
time-tested piece of software, hence it becomes very difficult to point
fingers
at any of these 3 components for this corruption.

Postgres is being deployed as a component of a carrier-grade platform, and
it is
required to run unattended as much as possible. There is a High Availability
monitoring component that is tasked with performing switchover to a standby
node
in the event of any problem with the primary node. This HA component needs
to
perform regular checks on health of all the other components, including
Postgres,
and take corrective actions.

With the zero pages comes the difficulty of ascertaining whether these are
legitimate zero pages, (since Postgres considers zero pages as valid (maybe
leftover from previous extend-file followed by a crash)), or are these zero
pages
a result of FS/hardware failure.

We are required to definitively differentiate between zero pages from
Postgres
vs. zero pages caused by hardware failure. Obviously this is not possible by
the
very nature of the problem, so we explored a few ideas, including per-block
checksums in-block or in checksum-fork, S.M.A.R.T monitoring of disk drives,
PageInit() before smgrextend() in ReadBuffer_common(), and additional member
in
PageHeader for a magic number.

Following is an approach which we think is least invasive, and does not
threaten
code-breakage, yet provides a definitive detection of corruption/data-loss
outside Postgres with least performance penalty.

Implementation:
---------------

.) The basic idea is to have a magic number in every PageHeader before it is
written to disk, and check for this magic number when performing page
validity
checks.

.) To avoid adding a new field to PageHeader, and any code breakage, we
reuse
an existing member of the structure.

.) We exploit the following facts and assumptions:
-) Relations/files are extended 8 KB (BLCKSZ) at a time.
-) Every I/O unit contains PageHeader structure (table/index/fork files),
which in turn contains pd_lsn as the first member.
-) Every newly written block is considered to be zero filled.
-) PageIsNew() assumes that if pd_upper is 0 then the page is zero.
-) PageHeaderIsValid() allows zero filled pages to be considered valid.
-) Anyone wishing to use a new page has to do PageInit() on the page.
-) PageInit() does a MemSet(0) on the whole page.
-) XLogRecPtr={x,0} is considered invalid
-) XLogRecPtr={x, ~((uint32)0)} is not valid either (i.e. last byte of an
xlog
file (not segment)); we'll use this as the magic number.

... Above is my assumption, since it is not mentioned anywhere in the
code.
The XLogFileSize calculation seems to support this assumptiopn.

... If this assumption doesn't hold good, then the previous assumption
{x,0}
can also be used to implement this magic number (with x > 0).
-) There's only one implementation of Storage Manager, i.e. md.c.
-) smgr_extend() -> mdextend() is the only place where a relation is
extended.
-) Writing beyond EOF in a file causes the intermediate space to become a
hole,
and any reads from such a hole returns zero filled pages.
-) Anybody trying to extend a file makes sure that there's no cuncurrent
extension going on from somewhere else.
... This is ensured either by implicit nature of the calling code, or
by
calling LockRelationForExtension().

.) In mdextend(), if the buffer being written is zero filled, then we write
the
magic number in that page's pd_lsn.
... This check can be optimized to just check sizeof(pd_lsn) worth of
buffer.

.) In mdextend(), if the buffer is being written beyond current EOF, then we
forcibly write the intermediate blocks too, and write the magic number in
each of those.
... This needs an _mdnblocks() call and FileSeek(SEEK_END)+FileWrite()
calls
for every block in the hole.

... Creation of holes is being assumed to be a very limited corner case,
hence this performace hit is acceptable in these rare corner cases. Tests
are
being planned using real application, to check how many times this
occurs.

.) PageHeaderIsValid() needs to be modified to allow
MagicNumber-followed-by-zeroes
as a valid page (rather than a completely zero page)
... If the page is completely filled with zeroes, this confirms the fact
that
either the filesystem or the disk storage zeroed these pages, since
Postgres
never wrote zero pages to disk.

.) PageInit() and PageIsNew() require no change.

.) XLByteLT(), XLByteLE() and XLByteEQ() may be changed to contain
AssertMacro( !MagicNumber(a) && !MagicNumber(b) )

.) I haven't analyzed the effects of this change on the recovery code, but I
have a feeling that we might not need to change anything there.

.) We can create a contrib module (standalone binary or a loadable module)
that
goes through each disk page and checks it for being zero filled, and
raises
alarm if it finds any.

Thoughts welcome.
--
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh(dot)gurjeet(at){ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Martijn van Oosterhout 2010-11-06 11:04:29 Re: temporary functions (and other object types)
Previous Message Heikki Linnakangas 2010-11-06 07:21:44 Re: "Make" versus effective stack limit in regression tests