Re: 9.4 pg_control corruption

Lists: pgsql-hackers
From: Steve Singer <steve(at)ssinger(dot)info>
To: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: 9.4 pg_control corruption
Date: 2014-07-09 01:41:00
Message-ID: BLU436-SMTP2539162CBA275312AE14DFADC0F0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've encountered a corrupt pg_control file on my 9.4 development
cluster. I've mostly been using the cluster for changeset extraction /
slony testing.

This is a 9.4 (currently commit 6ad903d70a440e + a walsender change
discussed in another thread) but would have had the initdb done with an
earlier 9.4 snapshot.

/usr/local/pgsql94wal/bin$ ./pg_controldata ../data
WARNING: Calculated CRC checksum does not match value stored in file.
Either the file is corrupt, or it has a different layout than this program
is expecting. The results below are untrustworthy.

pg_control version number: 937
Catalog version number: 201405111
Database system identifier: 6014096177254975326
Database cluster state: in production
pg_control last modified: Tue 08 Jul 2014 06:15:58 PM EDT
Latest checkpoint location: 5/44DC5FC8
Prior checkpoint location: 5/44C58B88
Latest checkpoint's REDO location: 5/44DC5FC8
Latest checkpoint's REDO WAL file: 000000010000000500000044
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/1558590
Latest checkpoint's NextOID: 505898
Latest checkpoint's NextMultiXactId: 3285
Latest checkpoint's NextMultiOffset: 6569
Latest checkpoint's oldestXID: 1281
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Time of latest checkpoint: Tue 08 Jul 2014 06:15:23 PM EDT
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: logical
Current wal_log_hints setting: off
Current max_connections setting: 200
Current max_worker_processes setting: 8
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 65793
Date/time type storage: floating-point numbers
Float4 argument passing: by reference
Float8 argument passing: by reference
Data page checksum version: 2602751502
ssinger(at)ssinger-laptop:/usr/local/pgsql94wal/bin$

Before this postgres crashed, and seemed to have problems recovering. I
might have hit CTRL-C but I didn't do anything drastic like issue a kill -9.

test1 [unknown] 2014-07-08 18:15:18.986 EDTFATAL: the database system
is in recovery mode
test1 [unknown] 2014-07-08 18:15:20.482 EDTWARNING: terminating
connection because of crash of another server process
test1 [unknown] 2014-07-08 18:15:20.482 EDTDETAIL: The postmaster has
commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
test1 [unknown] 2014-07-08 18:15:20.482 EDTHINT: In a moment you should
be able to reconnect to the database and repeat your command.
2014-07-08 18:15:20.483 EDTLOG: all server processes terminated;
reinitializing
2014-07-08 18:15:20.720 EDTLOG: database system was interrupted;
last known up at 2014-07-08 18:15:15 EDT
2014-07-08 18:15:20.865 EDTLOG: database system was not properly
shut down; automatic recovery in progress
2014-07-08 18:15:20.954 EDTLOG: redo starts at 5/41023848
2014-07-08 18:15:23.153 EDTLOG: unexpected pageaddr 4/D8DC6000 in
log segment 000000010000000500000044, offset 14442496
2014-07-08 18:15:23.153 EDTLOG: redo done at 5/44DC5F60
2014-07-08 18:15:23.153 EDTLOG: last completed transaction was at
log time 2014-07-08 18:15:17.874937-04
test2 [unknown] 2014-07-08 18:15:24.247 EDTFATAL: the database system
is in recovery mode
test2 [unknown] 2014-07-08 18:15:24.772 EDTFATAL: the database system
is in recovery mode
test2 [unknown] 2014-07-08 18:15:25.281 EDTFATAL: the database system
is in recovery mode
test1 [unknown] 2014-07-08 18:15:25.547 EDTFATAL: the database system
is in recovery mode
test2 [unknown] 2014-07-08 18:15:25.548 EDTFATAL: the database system
is in recovery mode
test3 [unknown] 2014-07-08 18:15:25.549 EDTFATAL: the database system
is in recovery mode
test4 [unknown] 2014-07-08 18:15:25.557 EDTFATAL: the database system
is in recovery mode
test5 [unknown] 2014-07-08 18:15:25.582 EDTFATAL: the database system
is in recovery mode
test2 [unknown] 2014-07-08 18:15:25.584 EDTFATAL: the database system
is in recovery mode
test1 [unknown] 2014-07-08 18:15:25.618 EDTFATAL: the database system
is in recovery mode
test2 [unknown] 2014-07-08 18:15:25.619 EDTFATAL: the database system
is in recovery mode
test3 [unknown] 2014-07-08 18:15:25.621 EDTFATAL: the database system
is in recovery mode
test4 [unknown] 2014-07-08 18:15:25.622 EDTFATAL: the database system
is in recovery mode
test5 [unknown] 2014-07-08 18:15:25.623 EDTFATAL: the database system
is in recovery mode
test1 [unknown] 2014-07-08 18:15:25.624 EDTFATAL: the database system
is in recovery mode
test1 [unknown] 2014-07-08 18:15:25.633 EDTFATAL: the database system
is in recovery mode
^C 2014-07-08 18:15:52.316 EDTLOG: received fast shutdown request

The core file in gdb shows
ore was generated by `postgres: autovacuum w'.
Program terminated with signal 6, Aborted.
#0 0x00007f18be8af295 in ?? ()
(gdb) where
#0 0x00007f18be8af295 in ?? ()
#1 0x00007f18be8b2438 in ?? ()
#2 0x0000000000000020 in ?? ()
#3 0x0000000000000000 in ?? ()

I can't rule out that the hardware my laptop is misbehaving but I
haven't noticed any other problems doing non 9.4 stuff.

Has anyone seen anything similar with 9.4? Is there anything specific I
should investigate (I don't care about recovering the cluster).


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Steve Singer <steve(at)ssinger(dot)info>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 pg_control corruption
Date: 2014-07-09 02:14:31
Message-ID: 9965.1404872071@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Steve Singer <steve(at)ssinger(dot)info> writes:
> I've encountered a corrupt pg_control file on my 9.4 development
> cluster. I've mostly been using the cluster for changeset extraction /
> slony testing.

> This is a 9.4 (currently commit 6ad903d70a440e + a walsender change
> discussed in another thread) but would have had the initdb done with an
> earlier 9.4 snapshot.

Somehow or other you missed the update to pg_control version number 942.
There's no obvious reason to think that this pg_control file is corrupt
on its own terms, but the pg_controldata version you're using expects
the 942 layout. The fact that the server wasn't complaining about this
suggests that you've not recompiled the server, or at least not xlog.c.
Possibly the odd failure to restart indicates that you have a partially
updated server executable?

regards, tom lane


From: Steve Singer <steve(at)ssinger(dot)info>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 pg_control corruption
Date: 2014-07-09 02:36:41
Message-ID: BLU436-SMTP1965BE086CFD8F8BC85F597DC0F0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 07/08/2014 10:14 PM, Tom Lane wrote:
> Steve Singer <steve(at)ssinger(dot)info> writes:
>> I've encountered a corrupt pg_control file on my 9.4 development
>> cluster. I've mostly been using the cluster for changeset extraction /
>> slony testing.
>> This is a 9.4 (currently commit 6ad903d70a440e + a walsender change
>> discussed in another thread) but would have had the initdb done with an
>> earlier 9.4 snapshot.
> Somehow or other you missed the update to pg_control version number 942.
> There's no obvious reason to think that this pg_control file is corrupt
> on its own terms, but the pg_controldata version you're using expects
> the 942 layout. The fact that the server wasn't complaining about this
> suggests that you've not recompiled the server, or at least not xlog.c.
> Possibly the odd failure to restart indicates that you have a partially
> updated server executable?

The server is complaining about that, it started to after the crash
(which is why I ran pg_controldata)

ssinger(at)ssinger-laptop:/usr/local/pgsql94wal/bin$ ./postgres -D ../data
2014-07-08 22:28:57.796 EDTFATAL: database files are incompatible
with server
2014-07-08 22:28:57.796 EDTDETAIL: The database cluster was
initialized with PG_CONTROL_VERSION 937, but the server was compiled
with PG_CONTROL_VERSION 942.
2014-07-08 22:28:57.796 EDTHINT: It looks like you need to initdb.
ssinger(at)ssinger-laptop:/usr/local/pgsql94wal/bin$

The server seemed fine (and it was 9.4 because I was using 9.4 features)
The server crashed
The server performed crash recovery
The server server wouldn't start and pg_controldata shows the attached
output

I wasn't recompiling or reinstalling around this time either.

> regards, tom lane
>
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Steve Singer <steve(at)ssinger(dot)info>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 pg_control corruption
Date: 2014-07-09 02:56:12
Message-ID: 10934.1404874572@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Steve Singer <steve(at)ssinger(dot)info> writes:
> On 07/08/2014 10:14 PM, Tom Lane wrote:
>> There's no obvious reason to think that this pg_control file is corrupt
>> on its own terms, but the pg_controldata version you're using expects
>> the 942 layout. The fact that the server wasn't complaining about this
>> suggests that you've not recompiled the server, or at least not xlog.c.

> The server is complaining about that, it started to after the crash

Then you updated your sources, recompiled and reinstalled, but failed to
restart the server when you did that. Else it would have complained on
the spot.

If you had any valuable data in the installation, we could talk about how
to get it out; but since you didn't I'd suggest just re-initdb and move
on. I don't see anything unexpected here.

regards, tom lane


From: 李海龙 <hailong(dot)li(at)qunar(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "steve(at)ssinger(dot)info" <steve(at)ssinger(dot)info>
Subject: Re: 9.4 pg_control corruption
Date: 2014-07-27 12:21:56
Message-ID: 53D4EEE2.2050807@qunar.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Hi,dear steven && pgsql-hackers

I've encountered the similar phenonmenon with 9.4 .

1. environment

1.1 OS version

postgres(at)lhl-Latitude-E5420:~$ cat /etc/issue
Ubuntu 13.10 \n \l

postgres(at)lhl-Latitude-E5420:~$ uname -av
Linux lhl-Latitude-E5420 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9
16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

1.2 PostgreSQL version

postgres(at)lhl-Latitude-E5420:~$ /opt/pg94/bin/pg_controldata --version
pg_controldata (PostgreSQL) 9.4beta2
postgres(at)lhl-Latitude-E5420:~$ /opt/pg94/bin/pg_config
BINDIR = /opt/pg94/bin
DOCDIR = /opt/pg94/share/doc/postgresql
HTMLDIR = /opt/pg94/share/doc/postgresql
INCLUDEDIR = /opt/pg94/include
PKGINCLUDEDIR = /opt/pg94/include/postgresql
INCLUDEDIR-SERVER = /opt/pg94/include/postgresql/server
LIBDIR = /opt/pg94/lib
PKGLIBDIR = /opt/pg94/lib/postgresql
LOCALEDIR = /opt/pg94/share/locale
MANDIR = /opt/pg94/share/man
SHAREDIR = /opt/pg94/share/postgresql
SYSCONFDIR = /opt/pg94/etc/postgresql
PGXS = /opt/pg94/lib/postgresql/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/opt/pg94' '--with-perl' '--with-libxml'
'--with-libxslt' '--with-ossp-uuid'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard
CFLAGS_SL = -fpic
LDFLAGS = -L../../../src/common -Wl,--as-needed
-Wl,-rpath,'/opt/pg94/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lxslt -lxml2 -lz -lreadline -lrt -lcrypt
-ldl -lm
VERSION = PostgreSQL 9.4beta2

2. phenonmenon

I have a PostgreSQL datadir named /export/pg94beta1_data/ which was
initialized with PostgreSQL 9.4beta1,

postgres(at)lhl-Latitude-E5420:~$ /opt/pg94/bin/pg_controldata
/export/pg94beta1_data/
WARNING: Calculated CRC checksum does not match value stored in file.
Either the file is corrupt, or it has a different layout than this program
is expecting. The results below are untrustworthy.

pg_control version number: 937
Catalog version number: 201405111
Database system identifier: 6014427290583411360
Database cluster state: in production
pg_control last modified: 2014年07月27日 星期日 16时36分50秒
Latest checkpoint location: 0/17462890
Prior checkpoint location: 0/17462828
Latest checkpoint's REDO location: 0/17462890
Latest checkpoint's REDO WAL file: 000000010000000000000017
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: off
Latest checkpoint's NextXID: 0/1387
Latest checkpoint's NextOID: 22220
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 715
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Time of latest checkpoint: 2014年07月27日 星期日 16时36分50秒
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: minimal
Current wal_log_hints setting: off
Current max_connections setting: 100
Current max_worker_processes setting: 8
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 65793
Date/time type storage: floating-point numbers
Float4 argument passing: by reference
Float8 argument passing: by reference
Data page checksum version: 307500851

but the server complained about the following when I started it with
PostgreSQL 9.4beta2,

postgres(at)lhl-Latitude-E5420:~$ /opt/pg94/bin/pg_ctl -D
/export/pg94beta1_data/ start
server starting
postgres(at)lhl-Latitude-E5420:~$ [ 2014-07-27 19:23:57.922 CST 27983
53d4e14d.6d4f 1 0]FATAL: database files are incompatible with server
[ 2014-07-27 19:23:57.922 CST 27983 53d4e14d.6d4f 2 0]DETAIL: The
database cluster was initialized with PG_CONTROL_VERSION 937, but the
server was compiled with PG_CONTROL_VERSION 942.
[ 2014-07-27 19:23:57.922 CST 27983 53d4e14d.6d4f 3 0]HINT: It looks
like you need to initdb.

I always think that it should not come up the PG_CONTROL_VERSION
mismatch when the PostgreSQL version upgrade between the small version .

Is there some important differences in PostgreSQL 9.4 ?

Thanks

Best Regards!

于 2014年07月09日 10:36, Steve Singer 写道:
> On 07/08/2014 10:14 PM, Tom Lane wrote:
>> Steve Singer <steve(at)ssinger(dot)info> writes:
>>> I've encountered a corrupt pg_control file on my 9.4 development
>>> cluster. I've mostly been using the cluster for changeset extraction /
>>> slony testing.
>>> This is a 9.4 (currently commit 6ad903d70a440e + a walsender change
>>> discussed in another thread) but would have had the initdb done with an
>>> earlier 9.4 snapshot.
>> Somehow or other you missed the update to pg_control version number 942.
>> There's no obvious reason to think that this pg_control file is corrupt
>> on its own terms, but the pg_controldata version you're using expects
>> the 942 layout. The fact that the server wasn't complaining about this
>> suggests that you've not recompiled the server, or at least not xlog.c.
>> Possibly the odd failure to restart indicates that you have a partially
>> updated server executable?
>
>
> The server is complaining about that, it started to after the crash
> (which is why I ran pg_controldata)
>
> ssinger(at)ssinger-laptop:/usr/local/pgsql94wal/bin$ ./postgres -D ../data
> 2014-07-08 22:28:57.796 EDTFATAL: database files are incompatible
> with server
> 2014-07-08 22:28:57.796 EDTDETAIL: The database cluster was
> initialized with PG_CONTROL_VERSION 937, but the server was compiled
> with PG_CONTROL_VERSION 942.
> 2014-07-08 22:28:57.796 EDTHINT: It looks like you need to initdb.
> ssinger(at)ssinger-laptop:/usr/local/pgsql94wal/bin$
>
>
> The server seemed fine (and it was 9.4 because I was using 9.4 features)
> The server crashed
> The server performed crash recovery
> The server server wouldn't start and pg_controldata shows the attached
> output
>
> I wasn't recompiling or reinstalling around this time either.
>
>
>
>> regards, tom lane
>>
>>
>
>
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: 李海龙 <hailong(dot)li(at)qunar(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "steve(at)ssinger(dot)info" <steve(at)ssinger(dot)info>
Subject: Re: 9.4 pg_control corruption
Date: 2014-07-27 16:35:49
Message-ID: 24765.1406478949@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

=?utf-8?B?5p2O5rW36b6Z?= <hailong(dot)li(at)qunar(dot)com> writes:
> I have a PostgreSQL datadir named /export/pg94beta1_data/ which was
> initialized with PostgreSQL 9.4beta1,
> [ and 9.4beta2 won't start with it ]

This is expected; you need to initdb. Or use pg_upgrade to upgrade
the cluster. We had to change pg_control format post-beta1.

regards, tom lane


From: 李海龙 <hailong(dot)li(at)qunar(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "steve(at)ssinger(dot)info" <steve(at)ssinger(dot)info>
Subject: Re: 9.4 pg_control corruption
Date: 2014-07-27 16:56:02
Message-ID: 53D52F21.3090505@qunar.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Understand!

Before I wrote last email, I had initialized a new db with PostgreSQL
9.4beta2 and restored the pg_dumpall data of /export/pg94beta1_data/

Thanks

Best Regards!

at 2014-07-28 00:35 +08, Tom Lane wrote:
> =?utf-8?B?5p2O5rW36b6Z?= <hailong(dot)li(at)qunar(dot)com> writes:
>> I have a PostgreSQL datadir named /export/pg94beta1_data/ which was
>> initialized with PostgreSQL 9.4beta1,
>> [ and 9.4beta2 won't start with it ]
> This is expected; you need to initdb. Or use pg_upgrade to upgrade
> the cluster. We had to change pg_control format post-beta1.
>
> regards, tom lane
>
>


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, 李海龙 <hailong(dot)li(at)qunar(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "steve(at)ssinger(dot)info" <steve(at)ssinger(dot)info>
Subject: Re: 9.4 pg_control corruption
Date: 2014-07-27 18:15:21
Message-ID: 53D541B9.1030403@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 07/27/2014 09:35 AM, Tom Lane wrote:
> =?utf-8?B?5p2O5rW36b6Z?= <hailong(dot)li(at)qunar(dot)com> writes:
>> I have a PostgreSQL datadir named /export/pg94beta1_data/ which was
>> initialized with PostgreSQL 9.4beta1,
>> [ and 9.4beta2 won't start with it ]
>
> This is expected; you need to initdb. Or use pg_upgrade to upgrade
> the cluster. We had to change pg_control format post-beta1.

Thank you for testing that though!

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com