after using pg_resetxlog, db lost

Lists: pgsql-performance
From: "Shea,Dan [CIS]" <Dan(dot)Shea(at)ec(dot)gc(dot)ca>
To: <pgsql-performance(at)postgresql(dot)org>
Subject: after using pg_resetxlog, db lost
Date: 2004-06-22 18:42:56
Message-ID: 644D07D3D59D8F408CD01AC2F833D8C62B9209@cisxa.cmc.int.ec.gc.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

The pg_resetxlog was run as root. It caused ownership problems of
pg_control and xlog files.
Now we have no access to the data now through psql. The data is still
there under /var/lib/pgsql/data/base/17347 (PWFPM_DEV DB name). But
there is no reference to 36 of our tables in pg_class. Also the 18
other tables that are reported in this database have no data in them.
Is there anyway to have the database resync or make it aware of the data
under /var/lib/pgsql/data/base/17347?
How can this problem be resolved?

There is actually 346 db files adding up to 134 GB in this database.

Below are error messages of when the database trying to be started. I
am not sure of the when pg_resetxlog was run. I suspect it was run to
get rid ot the "invalid primary checkpoint record".

The postgresql DB had an error trying to be started up.
The error was
Jun 22 13:17:53 murphy postgres[27430]: [4-1] LOG: invalid primary
checkpoint record
Jun 22 13:17:53 murphy postgres[27430]: [5-1] LOG: could not open file
"/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment 0):
No such file or directory
Jun 22 13:18:49 murphy postgres[28778]: [6-1] LOG: invalid secondary
checkpoint record
Jun 22 13:18:49 murphy postgres[28778]: [7-1] PANIC: could not locate a
valid checkpoint record

Jun 22 13:26:01 murphy postgres[30770]: [6-1] LOG: database system is
ready
Jun 22 13:26:02 murphy postgresql: Starting postgresql service:
succeeded
Jun 22 13:26:20 murphy postgres[30789]: [2-1] PANIC: could not access
status of transaction 553
Jun 22 13:26:20 murphy postgres[30789]: [2-2] DETAIL: could not open
file "/var/lib/pgsql/data/pg_clog/0000": No such file or directory
Jun 22 13:26:20 murphy postgres[30789]: [2-3] STATEMENT: COMMIT

and
Jun 22 13:26:20 murphy postgres[30791]: [10-1] LOG: redo starts at
0/2000050
Jun 22 13:26:20 murphy postgres[30791]: [11-1] LOG: file
"/var/lib/pgsql/data/pg_clog/0000" doesn't exist, reading as zeroes
Jun 22 13:26:20 murphy postgres[30791]: [12-1] LOG: record with zero
length at 0/2000E84
Jun 22 13:26:20 murphy postgres[30791]: [13-1] LOG: redo done at
0/2000E60
Jun 22 13:26:20 murphy postgres[30791]: [14-1] WARNING: xlog flush
request 213/7363F354 is not satisfied --- flushed only to 0/2000E84
Jun 22 13:26:20 murphy postgres[30791]: [14-2] CONTEXT: writing block
840074 of relation 17347/356768772
Jun 22 13:26:20 murphy postgres[30791]: [15-1] WARNING: xlog flush
request 213/58426648 is not satisfied --- flushed only to 0/2000E84

and
Jun 22 13:38:23 murphy postgres[1460]: [2-1] ERROR: xlog flush request
210/E757F150 is not satisfied --- flushed only to 0/2074CA0
Jun 22 13:38:23 murphy postgres[1460]: [2-2] CONTEXT: writing block
824605 of relation 17347/356768772

We are using a san for our storage device.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Shea,Dan [CIS]" <Dan(dot)Shea(at)ec(dot)gc(dot)ca>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: after using pg_resetxlog, db lost
Date: 2004-06-22 19:36:03
Message-ID: 21115.1087932963@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Shea,Dan [CIS]" <Dan(dot)Shea(at)ec(dot)gc(dot)ca> writes:
> The pg_resetxlog was run as root. It caused ownership problems of
> pg_control and xlog files.
> Now we have no access to the data now through psql. The data is still
> there under /var/lib/pgsql/data/base/17347 (PWFPM_DEV DB name). But
> there is no reference to 36 of our tables in pg_class. Also the 18
> other tables that are reported in this database have no data in them.
> Is there anyway to have the database resync or make it aware of the data
> under /var/lib/pgsql/data/base/17347?
> How can this problem be resolved?

What this sounds like is that you reset the transaction counter along
with the xlog, so that those tables appear to have been created by
transactions "in the future". This could be repaired by doing
pg_resetxlog with a more appropriate initial transaction ID, but
figuring out what that value should be is not easy :-(

What I'd suggest is grabbing pg_filedump from
http://sources.redhat.com/rhdb/
and using it to look through pg_class (which will be file
$PGDATA/base/yourdbnumber/1259) to see the highest transaction ID
mentioned in any row of pg_class. Then pg_resetxlog with a value
a bit larger than that. Now you should be able to see all the rows
in pg_class ... but this doesn't get you out of the woods yet, unless
there are very-recently-created tables shown in pg_class. I'd suggest
next looking through whichever tables you know to be recently modified
to find the highest transaction ID mentioned in them, and finally doing
another pg_resetxlog with a value a few million greater than that. Then
you should be okay.

The reason you need to do this in two steps is that you'll need to look
at pg_class.relfilenode to get the file names of your recently-modified
tables. Do NOT modify the database in any way while you are running
with the intermediate transaction ID setting.

> Jun 22 13:38:23 murphy postgres[1460]: [2-1] ERROR: xlog flush request
> 210/E757F150 is not satisfied --- flushed only to 0/2074CA0

Looks like you also need a larger initial WAL offset in your
pg_resetxlog command. Unlike the case with transaction IDs, there's
no need to try to be somewhat accurate in the setting --- I'd just
use a number WAY beyond what you had, maybe like 10000/0.

Finally, the fact that all this happened suggests that you lost the
contents of pg_control (else pg_resetxlog would have picked up the right
values from it). Be very sure that you run pg_resetxlog under the same
locale settings (LC_COLLATE,LC_CTYPE) that you initially initdb'd with.
Otherwise you're likely to have nasty index-corruption problems later.

Good luck. Next time, don't let amateurs fool with pg_resetxlog (and
anyone who'd run it as root definitely doesn't know what they're doing).
It is a wizard's tool. Get knowledgeable advice from the PG lists
before you use it rather than after.

regards, tom lane


From: Richard Huxton <dev(at)archonet(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Shea,Dan [CIS]" <Dan(dot)Shea(at)ec(dot)gc(dot)ca>, pgsql-performance(at)postgresql(dot)org
Subject: Re: after using pg_resetxlog, db lost
Date: 2004-06-22 20:05:07
Message-ID: 40D890F3.8090208@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Tom Lane wrote:
> "Shea,Dan [CIS]" <Dan(dot)Shea(at)ec(dot)gc(dot)ca> writes:
>
>>The pg_resetxlog was run as root. It caused ownership problems of
>>pg_control and xlog files.
>>Now we have no access to the data now through psql. The data is still
>>there under /var/lib/pgsql/data/base/17347 (PWFPM_DEV DB name). But
>>there is no reference to 36 of our tables in pg_class. Also the 18
>>other tables that are reported in this database have no data in them.
>>Is there anyway to have the database resync or make it aware of the data
>>under /var/lib/pgsql/data/base/17347?
>>How can this problem be resolved?
>
>
> What this sounds like is that you reset the transaction counter along
> with the xlog, so that those tables appear to have been created by
> transactions "in the future". This could be repaired by doing
> pg_resetxlog with a more appropriate initial transaction ID, but
> figuring out what that value should be is not easy :-(

Tom - would there be any value in adding this to a pg_dump? I'm assuming
the numbers attached to tables etc are their OIDs anyway, so it might be
a useful reference in cases like this.

--
Richard Huxton
Archonet Ltd


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: "Shea,Dan [CIS]" <Dan(dot)Shea(at)ec(dot)gc(dot)ca>, pgsql-performance(at)postgresql(dot)org
Subject: Re: after using pg_resetxlog, db lost
Date: 2004-06-22 20:25:19
Message-ID: 21594.1087935919@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Richard Huxton <dev(at)archonet(dot)com> writes:
> Tom Lane wrote:
>> This could be repaired by doing
>> pg_resetxlog with a more appropriate initial transaction ID, but
>> figuring out what that value should be is not easy :-(

> Tom - would there be any value in adding this to a pg_dump?

Possibly. CVS tip pg_dump has been changed to not output OIDs by
default, as a result of gripes from people who wanted to be able to
"diff" dumps from different servers and not have the diff cluttered
by irrelevant OID differences. But a single header line showing
current XID and OID values doesn't seem like it would be a big problem.
We could put current timestamp there too, which was another recent topic
of discussion.

Bring it up on pghackers and see if anyone has an objection...

regards, tom lane