postmaster fails to start

Lists: pgsql-general
From: "Dweck Nir" <Nir(dot)Dweck(at)tadirantele(dot)com>
To: "postgreSQL mailing list (E-mail)" <pgsql-general(at)postgresql(dot)org>
Subject: postmaster fails to start
Date: 2005-05-25 08:08:54
Message-ID: 68382F2B929CEB4FAE828088C1BF067213950D@tbs-ex1.tadirantele.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi,
I need urgent help.
I am using PostgreSQL version 8.0.1.
postmaster fails to start and the log file looks as follow:

LOG: database system was shut down at 2005-05-24 15:50:46 MSD
LOG: checkpoint record is at 1/8D117BE4
LOG: redo record is at 1/8D117BE4; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 3859443; next OID: 1904360
LOG: database system is ready
LOG: could not send data to client: Broken pipe
LOG: received smart shutdown request
LOG: checkpoints are occurring too frequently (18 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: database system was interrupted at 2005-05-24 16:07:50 MSD
LOG: checkpoint record is at 1/A50109AC
LOG: redo record is at 1/A500075C; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 3859613; next OID: 1904360
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 1/A500075C
PANIC: btree_delete_page_redo: lost target page
LOG: startup process (PID 4409) was terminated by signal 6
LOG: aborting startup due to startup process failure
LOG: logger shutting down
LOG: database system was interrupted while in recovery at 2005-05-24 16:11:14 MSD
HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
LOG: checkpoint record is at 1/A50109AC
LOG: redo record is at 1/A500075C; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 3859613; next OID: 1904360
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 1/A500075C
PANIC: btree_delete_page_redo: lost target page
LOG: startup process (PID 4417) was terminated by signal 6
LOG: aborting startup due to startup process failure
LOG: logger shutting down

The sequence of events was as follow:
1) computer was shut down without stopping postmaster.
2) postmaster was started, but because of an error that there might be another postmaster running, the postmaster was started again.
3) since then each time I try to start the postmaster I get the same error.

To start the postmaster I use "pg_ctl start".
I am running on Linux redhat 9 kernel 2.4..20-8.

Regards,

Nir Dweck
Computer engineer
Tadiran Telecom
18 Hasivim Street,
P.O.Box 7607
Petach-Tikva 49170 Israel
Tel: 972-3-9262807
Fax: 972-3-9262755
<mailto:Nir(dot)dweck(at)tadirantele(dot)com>


From: Richard Huxton <dev(at)archonet(dot)com>
To: Dweck Nir <Nir(dot)Dweck(at)tadirantele(dot)com>
Cc: "postgreSQL mailing list (E-mail)" <pgsql-general(at)postgresql(dot)org>
Subject: Re: postmaster fails to start
Date: 2005-05-25 08:50:36
Message-ID: 42943C5C.2040009@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I've taken the liberty of rearranging your email slightly.

Dweck Nir wrote:
> The sequence of events was as follow: 1) computer was shut down
> without stopping postmaster.

OK - not good. Some crucial questions:
1. Do you have fsync enabled or disabled in the postgresql.conf file?
2. Do you know whether your drives are flushing write-cache properly?

> 2) postmaster was started, but because of an error that there might
> be another postmaster running, the postmaster was started again.

Was this just a matter of deleting the .pid file and did you check there
wasn't another postmaster running?

> 3) since then each time I try to start the postmaster I get the same
> error.

> LOG: redo starts at 1/A500075C PANIC: btree_delete_page_redo: lost
> target page LOG: startup process (PID 4409) was terminated by signal
> 6

OK - well, this error message is in backend/access/nbtree/nbtxlog.c
where it is replaying the write-ahead-log files for btrees (I'm no
hacker, I just searched the source for the error message and read the
comments).

So - it looks like you might have a corrupted WAL. That shouldn't be
possible if you were running with fsync enabled and drives that flushed
cache like they should, so I'm guessing that wasn't the case.

It might be possible to recover to a state before this point, but that's
not something I'm going to be able to advise on. There are two steps you
should take immediately though.

1. Take a file-backup of your entire data directory and keep it safe.
You might well be making repeated attempts to recover this.
2. Check your most recent database backup and restore it to another
machine - it may be quicker to restore that than fix your file corruption.

--
Richard Huxton
Archonet Ltd


From: Richard Huxton <dev(at)archonet(dot)com>
To: Dweck Nir <Nir(dot)Dweck(at)tadirantele(dot)com>
Cc: "postgreSQL mailing list (E-mail)" <pgsql-general(at)postgresql(dot)org>
Subject: Re: postmaster fails to start
Date: 2005-05-25 09:06:33
Message-ID: 42944019.7080402@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Dweck Nir wrote:
> Hi,
> I need urgent help.
> I am using PostgreSQL version 8.0.1.
> postmaster fails to start and the log file looks as follow:
>
> LOG: database system was shut down at 2005-05-24 15:50:46 MSD
> LOG: checkpoint record is at 1/8D117BE4
> LOG: redo record is at 1/8D117BE4; undo record is at 0/0; shutdown TRUE
> LOG: next transaction ID: 3859443; next OID: 1904360

Actually, it might be possible to use the PITR system to restore up to
just before the error (the transaction-id above might be a good start
point).

You'll want to move your WAL files to a different directory so it looks
like they've been copied from another machine. See this section of the
manuals for details of how to set up the recovery. Take your time
reading it thoroughly.
http://www.postgresql.org/docs/8.0/static/backup-online.html

IMPORTANT - make sure you have a backup copy of the entire data
directory before trying this.

A warning - I've not tried this particular idea out, but as long as you
can partially replay the first WAL file, I don't see why it shouldn't work.

--
Richard Huxton
Archonet Ltd


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Dweck Nir" <Nir(dot)Dweck(at)tadirantele(dot)com>
Cc: "postgreSQL mailing list (E-mail)" <pgsql-general(at)postgresql(dot)org>
Subject: Re: postmaster fails to start
Date: 2005-05-25 14:06:40
Message-ID: 11750.1117030000@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Dweck Nir" <Nir(dot)Dweck(at)tadirantele(dot)com> writes:
> LOG: database system was not properly shut down; automatic recovery in =
> progress
> LOG: redo starts at 1/A500075C
> PANIC: btree_delete_page_redo: lost target page

This seems closely related to the problem discussed in this recent
thread:
http://archives.postgresql.org/pgsql-admin/2005-04/msg00008.php

regards, tom lane