Re: could not access status of transaction

Lists: pgsql-general
From: Tom Duffey <tduffey(at)techbydesign(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: could not access status of transaction
Date: 2009-03-25 17:22:51
Message-ID: BF7973E2-F1CB-4E11-8076-D798CAA2C611@techbydesign.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi All,

One of our databases suffered a problem yesterday during a normal
update, something we have been doing for years. Near the end of the
process a foreign key constraint is rebuilt on a table containing
several hundred million rows. Rebuilding the constraint failed with
the following message:

ERROR: could not access status of transaction 4294918145
DETAIL: Could not open file "pg_clog/0FFF": No such file or directory.

Here's the table and constraint definitions:

CREATE TABLE point_history (
point_id integer NOT NULL,
value real NOT NULL,
status integer NOT NULL,
"timestamp" timestamp without time zone NOT NULL
);

ALTER TABLE point_history ADD CONSTRAINT point_history_pkey PRIMARY
KEY (point_id, "timestamp");
ALTER TABLE point_history ADD CONSTRAINT "$1" FOREIGN KEY (point_id)
REFERENCES point(id);

I read about this and and created the pg_clog/0FFF file, filling it
with 256K of zeroes and then vacuumed the database. Then I tried
rebuilding the constraint and received a foreign key violation:

DETAIL: Key (point_id)=(2) is not present in table "point".

The crappy thing about this is that there was no record in the
point_history table with point_id = 2:

db=> select * from point_history where point_id = 2;
point_id | value | status | timestamp
----------+-------+--------+-----------
(0 rows)

I scratched my head for a while and decided to reload the database
from a backup, which I'm still working on now. I'm wondering if
anyone has any thoughts or ideas about this? I found references to
similar problems but they were all for older versions of PostgreSQL.
When the problem occurred we were running 8.3.6 and are now running
8.3.7.

Tom

--
Tom Duffey <tduffey(at)techbydesign(dot)com>
Technology by Design :: http://techbydesign.com/
p: 414.431.0800


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tom Duffey <tduffey(at)techbydesign(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: could not access status of transaction
Date: 2009-03-26 02:02:21
Message-ID: 11462.1238032941@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Tom Duffey <tduffey(at)techbydesign(dot)com> writes:
> One of our databases suffered a problem yesterday during a normal
> update, something we have been doing for years. Near the end of the
> process a foreign key constraint is rebuilt on a table containing
> several hundred million rows. Rebuilding the constraint failed with
> the following message:

> ERROR: could not access status of transaction 4294918145
> DETAIL: Could not open file "pg_clog/0FFF": No such file or directory.

This looks like a garden-variety data corruption problem to me.
Trashed rows tend to yield this type of error because the "xmin"
transaction ID is the first field that the server can check with
any amount of finesse. 4294918145 is FFFF4001 in hex, saith my
calculator, so it looks like a bunch of bits went to ones --- or
perhaps more likely, the row offset in the page header got clobbered
and we're looking at some bytes that never were a transaction ID
at all.

So I'd try looking around for flaky RAM, failing disks, loose cables,
that sort of thing ...

regards, tom lane


From: Tom Duffey <tduffey(at)techbydesign(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: could not access status of transaction
Date: 2009-03-26 04:05:03
Message-ID: 7FEB4211-4BE1-4DFA-8E77-64376BE8E1A5@techbydesign.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi Tom,

On Mar 25, 2009, at 9:02 PM, Tom Lane wrote:

> Tom Duffey <tduffey(at)techbydesign(dot)com> writes:
>> One of our databases suffered a problem yesterday during a normal
>> update, something we have been doing for years. Near the end of the
>> process a foreign key constraint is rebuilt on a table containing
>> several hundred million rows. Rebuilding the constraint failed with
>> the following message:
>
>> ERROR: could not access status of transaction 4294918145
>> DETAIL: Could not open file "pg_clog/0FFF": No such file or
>> directory.
>
> This looks like a garden-variety data corruption problem to me.
> Trashed rows tend to yield this type of error because the "xmin"
> transaction ID is the first field that the server can check with
> any amount of finesse. 4294918145 is FFFF4001 in hex, saith my
> calculator, so it looks like a bunch of bits went to ones --- or
> perhaps more likely, the row offset in the page header got clobbered
> and we're looking at some bytes that never were a transaction ID
> at all.
>
> So I'd try looking around for flaky RAM, failing disks, loose cables,
> that sort of thing ...

Are you aware of any issues like this related to VMWare ESX? Our
PostgreSQL server is running in such an environment and I asked the
guys to review your email and they thought maybe this type of
corruption could happen when the virtual machine was moved from one
physical server to another, which we have done once or twice in the
past few months.

Tom

--
Tom Duffey <tduffey(at)techbydesign(dot)com>
Technology by Design :: http://techbydesign.com/
p: 414.431.0800