Re: Database corruption: finding the bad block

From: Erik Jones <erik(at)myemma(dot)com>
To: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
Cc: Postgres general mailing list <pgsql-general(at)postgresql(dot)org>
Subject: Re: Database corruption: finding the bad block
Date: 2007-07-12 14:53:36
Message-ID: 3A298647-2011-4FA4-AB28-CD5B4BC1638A@myemma.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Jul 12, 2007, at 8:09 AM, Csaba Nagy wrote:

> Hi all,
>
> I just had the following error on one of our data bases:
>
> ERROR: could not access status of transaction 1038286848
> DETAIL: could not open file "pg_clog/03DE": No such file or directory
>
> I researched on the mailing list and it looks like the usual
> suspect is
> disk page corruption. There are few previous discussions about how to
> dump the suspect disk page, e.g.:
>
> http://archives.postgresql.org/pgsql-general/2006-10/msg01372.php
> http://groups.google.com/group/comp.databases.postgresql.hackers/
> browse_frm/thread/4988752a6939f45a/fd9f12468e86dd3?
> hl=en&lr=&ie=UTF-8&rnum=8&prev=/groups%3Fq%3Dpg_filedump%26hl%3Den%
> 26lr%3D%26ie%3DUTF-8%26selm%3D20030922162322.E12708%
> 2540quartz.newn.cam.ac.uk%26rnum%3D8#fd9f12468e86dd3
>
> You can probably find more searching for "ERROR: could not access
> status of transaction" or "pg_filedump".
>
> What I could not find was a simple and fast way to find the bad block.
> The error message itself is not useful in this regard, and the "binary
> search" method is anything but fast on a big table.
>
> So I wrote the following plpgsql function:
>
> create or replace function
> find_bad_block(p_TableName text)
> returns tid
> as $find_bad_block$
> declare
> result tid;
> crtRow record;
> count bigint := 0;
> begin
> for crtRow
> in execute 'select ctid from ' || p_TableName
> loop
> result = crtRow.ctid;
> count := count + 1;
> if count % 500000 = 0 then
> raise notice 'rows processed: %', count;
> end if;
> end loop;
> return result;
> exception
> when OTHERS then
> raise notice '%: %', SQLSTATE, SQLERRM;
> return result;
> end;
> $find_bad_block$
> language plpgsql;
>
>
> This will spit out the error + the last valid block visited. If
> there's
> no bad block, you won't get the notice with the error, just the last
> block of the table... worked fine for me, resulting in:
>
> NOTICE: 58P01: could not access status of transaction 1038286848
> find_bad_block
> ----------------
> (657780,157)
> (1 row)
>
> Now to finding the file I should dump:
>
> select oid from pg_database where datname = 'dbname';
> oid
> -------
> 16389
> (1 row)
>
> select relfilenode from pg_class where relname='table_name';
> relfilenode
> -------------
> 20418
> (1 row)
>
> The file is then 'db_path/base/16389/20418'. Or a collection of
> '20418.x' files if the table's data is more than 1 GB, each file
> being a
> 1GB chunk of the table... so which one to dump ?
>
> First calculate the block count in one chunk: 1024 * 1024 /
> block_size,
> where block_size is the block size in KB with which postgres was
> compiled. That's normally 8, but we have systems where it is set to
> 32.
> If you didn't change that yourself, it is likely 8.
>
> So in my case the block count per chunk was: 1024 * 1024 / 8 = 131072.
>
> So the chunk number will be:
>
> blockId / blocks_per_chunk = 657780 / 131072 = 5
>
> So I should look in the file 'db_path/base/16389/20418.5'... but wait,
> the block id has to be made relative to the chunk file's start:
> chunk_block_id = block_id % blocks_per_chunk
>
> So the block id to use with pg_filedump is: 657780 % 131072 = 2420
>
> So my final command line was:
>
> pg_filedump -if -R 2420 db_path/base/16389/20418.5
>
> resulting in something like:
>
> [snip]
> <Data> ------
> Item 1 -- Length: 44 Offset: 8148 (0x1fd4) Flags: USED
> XMIN: 1048587 CMIN: 90130188 XMAX: 0 CMAX|XVAC: 1036845056
> Block Id: 655376 linp Index: 18451 Attributes: 1375 Size: 0
> infomask: 0x0200 (XMIN_INVALID)
> Error: Computed header length not equal to header size.
> Computed <28> Header: <0>
>
> 1fd4: 0b001000 0c475f05 00000000 0000cd3d .....G_........=
> 1fe4: 0a001000 13485f05 00020000 2b030000 .....H_.....+...
> 1ff4: 2d030000 00000000 01000000 -...........
>
> [snip]
>
> So I found my bad block, and the previous and following looked OK.
>
> Now I want to fix just that one block even if the rows on it are lost,
> as the table data is not detail-critical (massive amounts of lost data
> would be critical, small loss is tolerable).
>
> Now to the actual fixing: from my searches it looks like zeroing
> out the
> complete block + reindex the table seems to be the recommended
> solution
> if it is not possible to downright drop the table and restore from
> backup (in my case that is not possible - this error is there from
> last
> year's October, and all our backups failed from then to now - and
> nobody
> checked the logs till now when I tried to upgrade postgres via
> slony and
> failed to do it because of this error - rather telling for the
> importance of this DB).
>
> So, how to zero out the page ?
>
> The following message shows a way to do it:
>
> http://www.webservertalk.com/archive142-2004-7-319751.html
>
> Basically set the 'zero_damaged_pages' setting to 'on':
>
> http://www.postgresql.org/docs/8.1/interactive/runtime-config-
> developer.html
>
> ... and select that page again. Unfortunately this did not work for
> me... looks like if the page header is not corrupt but only individual
> tuples are a problem, the 'zero_damaged_pages' thing doesn't work.
>
> Next try:
>
> http://usr-share-man.org/man1/dd.html
>
> The start of my bad block is at: 8 * 2420 = 19360K, so I first dump
> the
> block for cross checking that I got the arguments right:
>
> dd ibs=1024 if=db_path/base/16389/20418.5 count=8 skip=19360
> of=cross_check_file
>
> Then I used:
>
> pg_filedump -if -R 0 cross_check_file
>
> and diffed against the original dump, to make it absolutely sure it's
> that right block... call me paranoid, but I'm not a dd expert and I
> had
> to be sure. It matched.
>
> So now to the destructive part of the action:
>
> dd ibs=1024 obs=1024 if=/dev/zero of=db_path/base/16389/20418.5
> count=8
> seek=19360 conv=notrunc
>
> Ok, I admit that on first try I didn't put the 'conv=notrunc' there...
> so with all the former paranoia I still ended up with a truncated
> chunk.
> Luckily I remembered I have a WAL logging based replica, so I
> recovered
> the rest of the truncated file from the replica's same file... this
> being an insert only table I was lucky I guess that this was an
> option.
> To my surprise, the same block on the replica was not mangled... I say
> to my surprise, because on other occasions the bad blocks readily
> replicated over. In any case if you have a WAL logged replica you
> might
> be lucky to recover the corrupt block(s) from there (or just switch
> over, but that is risky too, you can't know for sure in what state the
> replica is, and that is actually harder to investigate than the
> master,
> as you can execute no SQL on the replica).
>
> Anyway, that's it... it looks the problem is gone, and the DB will be
> moved to another box to postgres 8.2.4 via slony. I'm not sure how
> much
> of that table I mangled with my dd experiments, hope that others can
> learn from my mistakes...
>
> Cheers,
> Csaba.

Wow. Csaba, thanks for sharing. I'm definitely archiving this.

Erik Jones

Software Developer | Emma®
erik(at)myemma(dot)com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2007-07-12 14:54:29 Re: [GENERAL] Count(*) throws error
Previous Message Koen Vermeer 2007-07-12 14:44:36 Re: Force SSL / username combination