Lists: | pgsql-bugs |
---|
From: | hubert depesz lubaczewski <depesz(at)depesz(dot)com> |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | 9.1 doesn't start when died mid-backup |
Date: | 2011-04-13 07:49:24 |
Message-ID: | 20110413074924.GA11318@depesz.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
hi
got new 9.1, straight from git.
compiled, installed.
did initdb, and then set config values using this script:
perl -pi -e '
s/\A \s* (?: [#] \s* )? listen_addresses \s* = \s*.*/listen_addresses = \047*\047/x;
s/\A \s* (?: [#] \s* )? log_destination \s* = \s*.*/log_destination = \047stderr\047/x;
s/\A \s* (?: [#] \s* )? logging_collector \s* = \s*.*/logging_collector = on/x;
s/\A \s* (?: [#] \s* )? log_min_duration_statement \s* = \s*.*/log_min_duration_statement = 0/x;
s/\A \s* (?: [#] \s* )? log_line_prefix \s* = \s*.*/log_line_prefix = \047\%m \%u\(at)\%d \%p \%r \047/x;
s/\A \s* (?: [#] \s* )? log_temp_files \s* = \s*.*/log_temp_files = 0/x;
s/\A \s* (?: [#] \s* )? (log_checkpoints|log_connections|log_disconnections|log_lock_waits) \s* = \s*.*/$1 = on/x;
s/\A \s* (?: [#] \s* )? wal_level \s* = \s*.*/wal_level = hot_standby/x;
s/\A \s* (?: [#] \s* )? archive_mode \s* = \s*.*/archive_mode = on/x;
s/\A \s* (?: [#] \s* )? archive_command \s* = \s*.*/archive_command = \047\/bin\/true\047/x;
' data/postgresql.conf
afterwards I did start pg.
then, i connected with psql, and issued:
$ select pg_start_backup('xx');
pg_start_backup
-----------------
0/2000020
(1 row)
$ create database pgdba;
CREATE DATABASE
$ create user postgres with superuser;
CREATE ROLE
$ create user depesz with superuser;
CREATE ROLE
$ create database depesz with owner depesz;
CREATE DATABASE
Afterwards, I exited psql, and simulated power-failure, with:
=$ killall -9 postgres; killall -9 postmaster
postmaster: no process found
All pg processes were killed.
When I then tried to start Pg, I got:
=$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
server starting
but it didn't start.
log looks like this:
=$ cat postgresql-2011-04-13_093859.log
2011-04-13 09:38:59.807 CEST @ 3446 LOG: database system was interrupted; last known up at 2011-04-13 09:38:43 CEST
2011-04-13 09:38:59.807 CEST @ 3446 LOG: database system was not properly shut down; automatic recovery in progress
2011-04-13 09:38:59.852 CEST @ 3446 LOG: redo starts at 0/2000020
2011-04-13 09:39:00.304 CEST @ 3446 LOG: record with zero length at 0/20018E8
2011-04-13 09:39:00.304 CEST @ 3446 LOG: redo done at 0/2001898
2011-04-13 09:39:00.304 CEST @ 3446 LOG: last completed transaction was at log time 2011-04-13 09:38:43.563356+02
2011-04-13 09:39:00.304 CEST @ 3446 FATAL: WAL ends before end of online backup
2011-04-13 09:39:00.304 CEST @ 3446 HINT: Online backup started with pg_start_backup() must be ended with pg_stop_backup(), and all WAL up to that point must be available at recovery.
2011-04-13 09:39:00.304 CEST @ 3444 LOG: startup process (PID 3446) exited with exit code 1
2011-04-13 09:39:00.304 CEST @ 3444 LOG: aborting startup due to startup process failure
interestingl7y backup_label got renamed to backup_label.old, and
contains:
START WAL LOCATION: 0/2000020 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/2000058
START TIME: 2011-04-13 09:38:19 CEST
LABEL: xx
Given the file was renamed, I retried:
=$ pg_ctl start
server starting
but again - it didn't work.
Log looks like this:
2011-04-13 09:44:06.844 CEST @ 3783 LOG: database system was interrupted while in recovery at 2011-04-13 09:38:59 CEST
2011-04-13 09:44:06.844 CEST @ 3783 HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
2011-04-13 09:44:06.845 CEST @ 3783 LOG: database system was not properly shut down; automatic recovery in progress
2011-04-13 09:44:06.877 CEST @ 3783 LOG: redo starts at 0/2000020
2011-04-13 09:44:07.316 CEST @ 3783 LOG: record with zero length at 0/20018E8
2011-04-13 09:44:07.316 CEST @ 3783 LOG: redo done at 0/2001898
2011-04-13 09:44:07.316 CEST @ 3783 LOG: last completed transaction was at log time 2011-04-13 09:38:43.563356+02
2011-04-13 09:44:07.316 CEST @ 3783 FATAL: WAL ends before end of online backup
2011-04-13 09:44:07.316 CEST @ 3783 HINT: Online backup started with pg_start_backup() must be ended with pg_stop_backup(), and all WAL up to that point must be available at recovery.
2011-04-13 09:44:07.316 CEST @ 3776 LOG: startup process (PID 3783) exited with exit code 1
2011-04-13 09:44:07.316 CEST @ 3776 LOG: aborting startup due to startup process failure
So, it looks to me that by killing postgres mid-backup I made it end up in some state that doesn't let it start again.
I *think* it might be related to this:
http://archives.postgresql.org/pgsql-hackers/2011-03/msg01490.php and more
specifically, with this:
http://archives.postgresql.org/pgsql-hackers/2011-03/msg01627.php
Best regards,
depesz
--
The best thing about modern society is how easy it is to avoid contact with it.
http://depesz.com/
From: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
---|---|
To: | depesz(at)depesz(dot)com |
Cc: | pgsql-bugs(at)postgresql(dot)org, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
Subject: | Re: 9.1 doesn't start when died mid-backup |
Date: | 2011-04-13 11:18:30 |
Message-ID: | 4DA58686.1050501@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
On 13.04.2011 10:49, hubert depesz lubaczewski wrote:
> hi
> got new 9.1, straight from git.
> compiled, installed.
> did initdb, and then set config values using this script:
> perl -pi -e '
> s/\A \s* (?: [#] \s* )? listen_addresses \s* = \s*.*/listen_addresses = \047*\047/x;
> s/\A \s* (?: [#] \s* )? log_destination \s* = \s*.*/log_destination = \047stderr\047/x;
> s/\A \s* (?: [#] \s* )? logging_collector \s* = \s*.*/logging_collector = on/x;
> s/\A \s* (?: [#] \s* )? log_min_duration_statement \s* = \s*.*/log_min_duration_statement = 0/x;
> s/\A \s* (?: [#] \s* )? log_line_prefix \s* = \s*.*/log_line_prefix = \047\%m \%u\(at)\%d \%p \%r \047/x;
> s/\A \s* (?: [#] \s* )? log_temp_files \s* = \s*.*/log_temp_files = 0/x;
> s/\A \s* (?: [#] \s* )? (log_checkpoints|log_connections|log_disconnections|log_lock_waits) \s* = \s*.*/$1 = on/x;
> s/\A \s* (?: [#] \s* )? wal_level \s* = \s*.*/wal_level = hot_standby/x;
> s/\A \s* (?: [#] \s* )? archive_mode \s* = \s*.*/archive_mode = on/x;
> s/\A \s* (?: [#] \s* )? archive_command \s* = \s*.*/archive_command = \047\/bin\/true\047/x;
> ' data/postgresql.conf
>
> afterwards I did start pg.
>
> then, i connected with psql, and issued:
> $ select pg_start_backup('xx');
> pg_start_backup
> -----------------
> 0/2000020
> (1 row)
>
> $ create database pgdba;
> CREATE DATABASE
>
> $ create user postgres with superuser;
> CREATE ROLE
>
> $ create user depesz with superuser;
> CREATE ROLE
>
> $ create database depesz with owner depesz;
> CREATE DATABASE
>
> Afterwards, I exited psql, and simulated power-failure, with:
>
> =$ killall -9 postgres; killall -9 postmaster
> postmaster: no process found
>
> All pg processes were killed.
>
> When I then tried to start Pg, I got:
> =$ pg_ctl start
> pg_ctl: another server might be running; trying to start server anyway
> server starting
>
> but it didn't start.
>
> log looks like this:
> =$ cat postgresql-2011-04-13_093859.log
> 2011-04-13 09:38:59.807 CEST @ 3446 LOG: database system was interrupted; last known up at 2011-04-13 09:38:43 CEST
> 2011-04-13 09:38:59.807 CEST @ 3446 LOG: database system was not properly shut down; automatic recovery in progress
> 2011-04-13 09:38:59.852 CEST @ 3446 LOG: redo starts at 0/2000020
> 2011-04-13 09:39:00.304 CEST @ 3446 LOG: record with zero length at 0/20018E8
> 2011-04-13 09:39:00.304 CEST @ 3446 LOG: redo done at 0/2001898
> 2011-04-13 09:39:00.304 CEST @ 3446 LOG: last completed transaction was at log time 2011-04-13 09:38:43.563356+02
> 2011-04-13 09:39:00.304 CEST @ 3446 FATAL: WAL ends before end of online backup
> 2011-04-13 09:39:00.304 CEST @ 3446 HINT: Online backup started with pg_start_backup() must be ended with pg_stop_backup(), and all WAL up to that point must be available at recovery.
> 2011-04-13 09:39:00.304 CEST @ 3444 LOG: startup process (PID 3446) exited with exit code 1
> 2011-04-13 09:39:00.304 CEST @ 3444 LOG: aborting startup due to startup process failure
>
> interestingl7y backup_label got renamed to backup_label.old, and
> contains:
> START WAL LOCATION: 0/2000020 (file 000000010000000000000002)
> CHECKPOINT LOCATION: 0/2000058
> START TIME: 2011-04-13 09:38:19 CEST
> LABEL: xx
>
> Given the file was renamed, I retried:
>
> =$ pg_ctl start
> server starting
>
> but again - it didn't work.
>
> Log looks like this:
>
> 2011-04-13 09:44:06.844 CEST @ 3783 LOG: database system was interrupted while in recovery at 2011-04-13 09:38:59 CEST
> 2011-04-13 09:44:06.844 CEST @ 3783 HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
> 2011-04-13 09:44:06.845 CEST @ 3783 LOG: database system was not properly shut down; automatic recovery in progress
> 2011-04-13 09:44:06.877 CEST @ 3783 LOG: redo starts at 0/2000020
> 2011-04-13 09:44:07.316 CEST @ 3783 LOG: record with zero length at 0/20018E8
> 2011-04-13 09:44:07.316 CEST @ 3783 LOG: redo done at 0/2001898
> 2011-04-13 09:44:07.316 CEST @ 3783 LOG: last completed transaction was at log time 2011-04-13 09:38:43.563356+02
> 2011-04-13 09:44:07.316 CEST @ 3783 FATAL: WAL ends before end of online backup
> 2011-04-13 09:44:07.316 CEST @ 3783 HINT: Online backup started with pg_start_backup() must be ended with pg_stop_backup(), and all WAL up to that point must be available at recovery.
> 2011-04-13 09:44:07.316 CEST @ 3776 LOG: startup process (PID 3783) exited with exit code 1
> 2011-04-13 09:44:07.316 CEST @ 3776 LOG: aborting startup due to startup process failure
>
>
> So, it looks to me that by killing postgres mid-backup I made it end up in some state that doesn't let it start again.
>
> I *think* it might be related to this:
> http://archives.postgresql.org/pgsql-hackers/2011-03/msg01490.php and more
> specifically, with this:
> http://archives.postgresql.org/pgsql-hackers/2011-03/msg01627.php
Yep :-(. I'm thinking we need to revert that change, now that we know
there was a reason for only checking the end-of-WAL when doing archive
recovery after all. We can keep it as a warning, though. With a hint
explaining that if you're doing crash recovery from a database that
crashed while an online backup was in progress, it's ok, but if you're
recovering from an online backup, the backup is broken.
For backups taken with pg_basebackup in 9.1, we can add a flag to the
backup_label, indicating that the backup was taken with pg_basebackup.
For such backups, the above scenario really should not happen, and we
can still make it a hard error if it does.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
From: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
---|---|
To: | depesz(at)depesz(dot)com |
Cc: | pgsql-bugs(at)postgresql(dot)org, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
Subject: | Re: 9.1 doesn't start when died mid-backup |
Date: | 2011-04-13 19:10:44 |
Message-ID: | 4DA5F534.2060206@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
On 13.04.2011 14:18, Heikki Linnakangas wrote:
> On 13.04.2011 10:49, hubert depesz lubaczewski wrote:
>> hi
>> got new 9.1, straight from git.
>> compiled, installed.
>> did initdb, and then set config values using this script:
>> perl -pi -e '
>> s/\A \s* (?: [#] \s* )? listen_addresses \s* = \s*.*/listen_addresses
>> = \047*\047/x;
>> s/\A \s* (?: [#] \s* )? log_destination \s* = \s*.*/log_destination =
>> \047stderr\047/x;
>> s/\A \s* (?: [#] \s* )? logging_collector \s* =
>> \s*.*/logging_collector = on/x;
>> s/\A \s* (?: [#] \s* )? log_min_duration_statement \s* =
>> \s*.*/log_min_duration_statement = 0/x;
>> s/\A \s* (?: [#] \s* )? log_line_prefix \s* = \s*.*/log_line_prefix =
>> \047\%m \%u\(at)\%d \%p \%r \047/x;
>> s/\A \s* (?: [#] \s* )? log_temp_files \s* = \s*.*/log_temp_files = 0/x;
>> s/\A \s* (?: [#] \s* )?
>> (log_checkpoints|log_connections|log_disconnections|log_lock_waits)
>> \s* = \s*.*/$1 = on/x;
>> s/\A \s* (?: [#] \s* )? wal_level \s* = \s*.*/wal_level = hot_standby/x;
>> s/\A \s* (?: [#] \s* )? archive_mode \s* = \s*.*/archive_mode = on/x;
>> s/\A \s* (?: [#] \s* )? archive_command \s* = \s*.*/archive_command =
>> \047\/bin\/true\047/x;
>> ' data/postgresql.conf
>>
>> afterwards I did start pg.
>>
>> then, i connected with psql, and issued:
>> $ select pg_start_backup('xx');
>> pg_start_backup
>> -----------------
>> 0/2000020
>> (1 row)
>>
>> $ create database pgdba;
>> CREATE DATABASE
>>
>> $ create user postgres with superuser;
>> CREATE ROLE
>>
>> $ create user depesz with superuser;
>> CREATE ROLE
>>
>> $ create database depesz with owner depesz;
>> CREATE DATABASE
>>
>> Afterwards, I exited psql, and simulated power-failure, with:
>>
>> =$ killall -9 postgres; killall -9 postmaster
>> postmaster: no process found
>>
>> All pg processes were killed.
>>
>> When I then tried to start Pg, I got:
>> =$ pg_ctl start
>> pg_ctl: another server might be running; trying to start server anyway
>> server starting
>>
>> but it didn't start.
>>
>> log looks like this:
>> =$ cat postgresql-2011-04-13_093859.log
>> 2011-04-13 09:38:59.807 CEST @ 3446 LOG: database system was
>> interrupted; last known up at 2011-04-13 09:38:43 CEST
>> 2011-04-13 09:38:59.807 CEST @ 3446 LOG: database system was not
>> properly shut down; automatic recovery in progress
>> 2011-04-13 09:38:59.852 CEST @ 3446 LOG: redo starts at 0/2000020
>> 2011-04-13 09:39:00.304 CEST @ 3446 LOG: record with zero length at
>> 0/20018E8
>> 2011-04-13 09:39:00.304 CEST @ 3446 LOG: redo done at 0/2001898
>> 2011-04-13 09:39:00.304 CEST @ 3446 LOG: last completed transaction
>> was at log time 2011-04-13 09:38:43.563356+02
>> 2011-04-13 09:39:00.304 CEST @ 3446 FATAL: WAL ends before end of
>> online backup
>> 2011-04-13 09:39:00.304 CEST @ 3446 HINT: Online backup started with
>> pg_start_backup() must be ended with pg_stop_backup(), and all WAL up
>> to that point must be available at recovery.
>> 2011-04-13 09:39:00.304 CEST @ 3444 LOG: startup process (PID 3446)
>> exited with exit code 1
>> 2011-04-13 09:39:00.304 CEST @ 3444 LOG: aborting startup due to
>> startup process failure
>>
>> interestingl7y backup_label got renamed to backup_label.old, and
>> contains:
>> START WAL LOCATION: 0/2000020 (file 000000010000000000000002)
>> CHECKPOINT LOCATION: 0/2000058
>> START TIME: 2011-04-13 09:38:19 CEST
>> LABEL: xx
>>
>> Given the file was renamed, I retried:
>>
>> =$ pg_ctl start
>> server starting
>>
>> but again - it didn't work.
>>
>> Log looks like this:
>>
>> 2011-04-13 09:44:06.844 CEST @ 3783 LOG: database system was
>> interrupted while in recovery at 2011-04-13 09:38:59 CEST
>> 2011-04-13 09:44:06.844 CEST @ 3783 HINT: This probably means that
>> some data is corrupted and you will have to use the last backup for
>> recovery.
>> 2011-04-13 09:44:06.845 CEST @ 3783 LOG: database system was not
>> properly shut down; automatic recovery in progress
>> 2011-04-13 09:44:06.877 CEST @ 3783 LOG: redo starts at 0/2000020
>> 2011-04-13 09:44:07.316 CEST @ 3783 LOG: record with zero length at
>> 0/20018E8
>> 2011-04-13 09:44:07.316 CEST @ 3783 LOG: redo done at 0/2001898
>> 2011-04-13 09:44:07.316 CEST @ 3783 LOG: last completed transaction
>> was at log time 2011-04-13 09:38:43.563356+02
>> 2011-04-13 09:44:07.316 CEST @ 3783 FATAL: WAL ends before end of
>> online backup
>> 2011-04-13 09:44:07.316 CEST @ 3783 HINT: Online backup started with
>> pg_start_backup() must be ended with pg_stop_backup(), and all WAL up
>> to that point must be available at recovery.
>> 2011-04-13 09:44:07.316 CEST @ 3776 LOG: startup process (PID 3783)
>> exited with exit code 1
>> 2011-04-13 09:44:07.316 CEST @ 3776 LOG: aborting startup due to
>> startup process failure
>>
>>
>> So, it looks to me that by killing postgres mid-backup I made it end
>> up in some state that doesn't let it start again.
>>
>> I *think* it might be related to this:
>> http://archives.postgresql.org/pgsql-hackers/2011-03/msg01490.php and
>> more
>> specifically, with this:
>> http://archives.postgresql.org/pgsql-hackers/2011-03/msg01627.php
>
> Yep :-(. I'm thinking we need to revert that change, now that we know
> there was a reason for only checking the end-of-WAL when doing archive
> recovery after all. We can keep it as a warning, though. With a hint
> explaining that if you're doing crash recovery from a database that
> crashed while an online backup was in progress, it's ok, but if you're
> recovering from an online backup, the backup is broken.
Committed a quick revertion of this for now.
> For backups taken with pg_basebackup in 9.1, we can add a flag to the
> backup_label, indicating that the backup was taken with pg_basebackup.
> For such backups, the above scenario really should not happen, and we
> can still make it a hard error if it does.
Added this to the open items list.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
From: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
---|---|
To: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
Cc: | depesz(at)depesz(dot)com, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: 9.1 doesn't start when died mid-backup |
Date: | 2011-04-14 05:25:02 |
Message-ID: | BANLkTikVrTUL7aB=jWZRd6VFqsDD=WTJ_Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
On Wed, Apr 13, 2011 at 8:18 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Yep :-(. I'm thinking we need to revert that change, now that we know there
> was a reason for only checking the end-of-WAL when doing archive recovery
> after all. We can keep it as a warning, though. With a hint explaining that
> if you're doing crash recovery from a database that crashed while an online
> backup was in progress, it's ok, but if you're recovering from an online
> backup, the backup is broken.
Yeah, we cannot distinguish those two cases. I agree to emit such a warning.
> For backups taken with pg_basebackup in 9.1, we can add a flag to the
> backup_label, indicating that the backup was taken with pg_basebackup. For
> such backups, the above scenario really should not happen, and we can still
> make it a hard error if it does.
Agreed.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center