Re: Bad records in table

Lists: pgsql-general
From: "Peter Darley" <pdarley(at)kinesis-cem(dot)com>
To: "Pgsql-General" <pgsql-general(at)postgresql(dot)org>
Subject: Bad records in table
Date: 2002-07-02 15:15:01
Message-ID: NNEAICKPNOGDBHNCEDCPCEKPCKAA.pdarley@kinesis-cem.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Friends,
I have a table with bad data in it. Whenever the backend hits the tupples
with bad data I get:

Jul 2 07:44:18 db postgres[2652]: [22] DEBUG: server process (pid 2869)
was terminated by signal 11
Jul 2 07:44:18 db postgres[2652]: [23] DEBUG: terminating any other active
server processes
Jul 2 07:44:18 db postgres[2652]: [24] DEBUG: all server processes
terminated; reinitializing shared memory and semaphores
Jul 2 07:44:18 db postgres[2871]: [25] FATAL 1: The database system is
starting up
Jul 2 07:44:18 db postgres[2870]: [25] DEBUG: database system was
interrupted at 2002-07-02 07:42:30 PDT
Jul 2 07:44:18 db postgres[2870]: [26] DEBUG: checkpoint record is at
1/94E54768
Jul 2 07:44:18 db postgres[2870]: [27] DEBUG: redo record is at
1/94E54768; undo record is at 0/0; shutdown TRUE
Jul 2 07:44:18 db postgres[2870]: [28] DEBUG: next transaction id:
55508992; next oid: 13249044
Jul 2 07:44:18 db postgres[2870]: [29] DEBUG: database system was not
properly shut down; automatic recovery in progress
Jul 2 07:44:18 db postgres[2870]: [30] DEBUG: redo starts at 1/94E547A8
Jul 2 07:44:18 db postgres[2870]: [31] DEBUG: ReadRecord: record with zero
length at 1/94E547CC
Jul 2 07:44:18 db postgres[2870]: [32] DEBUG: redo done at 1/94E547A8
Jul 2 07:44:20 db postgres[2870]: [33] DEBUG: database system is ready

I can identify the bad records with a simple perl script that reads all the
records sequentially and lets me know which ones crash the back end. Now
that I know which ones they are, how can I get rid of them?

Once I get this all fixed, what kinds of tests should I do on the machine
to see if it's got some kind of hardware problem? It's RedHat Linux
2.4.9-31smp.

Thanks for your help!
Peter Darley


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Peter Darley" <pdarley(at)kinesis-cem(dot)com>
Cc: "Pgsql-General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Bad records in table
Date: 2002-07-02 19:00:07
Message-ID: 18430.1025636407@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Peter Darley" <pdarley(at)kinesis-cem(dot)com> writes:
> I can identify the bad records with a simple perl script that reads all the
> records sequentially and lets me know which ones crash the back end. Now
> that I know which ones they are, how can I get rid of them?

DELETE?

That might not work very well if the corruption is in toasted fields,
but you haven't given any details that would let us know (not even such
basic info as your PG version).

But before doing that I'd suggest eyeballing the bad data using a tool
such as pg_filedump (see http://sources.redhat.com/rhdb/tools.html).
Look at the "ctid" column of the bad tuples to see what part of the
file you need to dump (ctid is block number and item number). The
pattern of the corruption might give some clue what happened.

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Peter Darley <pdarley(at)kinesis-cem(dot)com>
Cc: Pgsql-General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Bad records in table
Date: 2002-07-03 00:58:10
Message-ID: 20020703105810.A7153@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Jul 02, 2002 at 08:15:01AM -0700, Peter Darley wrote:
> I can identify the bad records with a simple perl script that reads all the
> records sequentially and lets me know which ones crash the back end. Now
> that I know which ones they are, how can I get rid of them?

If you know which tuple it is that kills it, you can delete it. That should
be sufficient. If you know the ctid or oid, they're good keys for this sort
of thing.

> Once I get this all fixed, what kinds of tests should I do on the machine
> to see if it's got some kind of hardware problem? It's RedHat Linux
> 2.4.9-31smp.

I strongly recommend memtest86[1]. It runs from a floppy and is very good at
picking up memory errors. After that check out your hard disk with whatever
program your HDD manufacturer provides. But memory is the most likely.

Just in case, open the case and check for dust buildup and loose cables.

[1] http://www.memtest86.com/

HTH,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.


From: "Peter Darley" <pdarley(at)kinesis-cem(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Pgsql-General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Bad records in table
Date: 2002-07-03 20:58:16
Message-ID: NNEAICKPNOGDBHNCEDCPCEMFCKAA.pdarley@kinesis-cem.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Tom,
Sorry about that. It's 7.2.1 on RedHat 7.2.
I wasn't able to delete; whenever the database touched those records (even
to do a delete) the backend would crash.
It turns out that my problems were much deeper than I thought. There was a
bad dimm in the machine wreaking havoc.
Thanks for the pointer to pg_filedump.
Thanks,
Peter Darley

-----Original Message-----
From: pgsql-general-owner(at)postgresql(dot)org
[mailto:pgsql-general-owner(at)postgresql(dot)org]On Behalf Of Tom Lane
Sent: Tuesday, July 02, 2002 12:00 PM
To: Peter Darley
Cc: Pgsql-General
Subject: Re: [GENERAL] Bad records in table

"Peter Darley" <pdarley(at)kinesis-cem(dot)com> writes:
> I can identify the bad records with a simple perl script that reads all
the
> records sequentially and lets me know which ones crash the back end. Now
> that I know which ones they are, how can I get rid of them?

DELETE?

That might not work very well if the corruption is in toasted fields,
but you haven't given any details that would let us know (not even such
basic info as your PG version).

But before doing that I'd suggest eyeballing the bad data using a tool
such as pg_filedump (see http://sources.redhat.com/rhdb/tools.html).
Look at the "ctid" column of the bad tuples to see what part of the
file you need to dump (ctid is block number and item number). The
pattern of the corruption might give some clue what happened.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
message can get through to the mailing list cleanly


From: "Peter Darley" <pdarley(at)kinesis-cem(dot)com>
To: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Cc: "Pgsql-General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Bad records in table
Date: 2002-07-03 21:33:56
Message-ID: NNEAICKPNOGDBHNCEDCPKEMFCKAA.pdarley@kinesis-cem.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Martijn,
What I didn't think to mention in my original post is that I was unable to
delete the rows because the backend would crash.
It turned that the table was way beyond help (at least by someone at my
skill level), because there was some bad memory in the machine. Thanks for
the pointer to memtest86. It did the trick.
Thanks,
Peter Darley

-----Original Message-----
From: pgsql-general-owner(at)postgresql(dot)org
[mailto:pgsql-general-owner(at)postgresql(dot)org]On Behalf Of Martijn van
Oosterhout
Sent: Tuesday, July 02, 2002 5:58 PM
To: Peter Darley
Cc: Pgsql-General
Subject: Re: [GENERAL] Bad records in table

On Tue, Jul 02, 2002 at 08:15:01AM -0700, Peter Darley wrote:
> I can identify the bad records with a simple perl script that reads all
the
> records sequentially and lets me know which ones crash the back end. Now
> that I know which ones they are, how can I get rid of them?

If you know which tuple it is that kills it, you can delete it. That should
be sufficient. If you know the ctid or oid, they're good keys for this sort
of thing.

> Once I get this all fixed, what kinds of tests should I do on the machine
> to see if it's got some kind of hardware problem? It's RedHat Linux
> 2.4.9-31smp.

I strongly recommend memtest86[1]. It runs from a floppy and is very good at
picking up memory errors. After that check out your hard disk with whatever
program your HDD manufacturer provides. But memory is the most likely.

Just in case, open the case and check for dust buildup and loose cables.

[1] http://www.memtest86.com/

HTH,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)