'full_page_writes=off' , VACUUM and crashing streaming slaves...

Lists: pgsql-general
From: Sean Chittenden <sean(at)chittenden(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: 'full_page_writes=off' , VACUUM and crashing streaming slaves...
Date: 2012-10-06 17:32:00
Message-ID: F0D8FB08-5CD7-4905-9F9D-ECDB0E4ADD74@chittenden.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Master/slave replication, the master and slave both have full_page_writes disabled because the system is running on ZFS.

While the master was still being setup and data was being loaded, the system ran out of open file descriptors and Pg crashed a few times until this was realized and corrected. Streaming replication was setup and was correctly passing data to the slave:

repmgr(at)db01=# select * FROM pg_stat_replication ;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+---------+------------------+-----------------+-----------------+-------------+-------------------------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
68057 | 16384 | repmgr | walreceiver | 192.168.100.183 | | 60743 | 2012-10-06 16:41:36.963769+00 | streaming | 9E/E2C81210 | 9E/E2C81210 | 9E/E2C81210 | 9E/E2C80C68 | 0 | async
(1 row)

Once the host was up and running, I kicked off a manual VACUUM, at which point I see a bunch of the following errors on the master:

INFO: vacuuming "schma.tbl"
WARNING: relation "tbl" page 65305 is uninitialized --- fixing
WARNING: relation "tbl" page 70927 is uninitialized --- fixing
WARNING: relation "tbl" page 75286 is uninitialized --- fixing
WARNING: relation "tbl" page 86601 is uninitialized --- fixing
WARNING: relation "tbl" page 91434 is uninitialized --- fixing
WARNING: relation "tbl" page 91435 is uninitialized --- fixing
WARNING: relation "tbl" page 95639 is uninitialized --- fixing
WARNING: relation "tbl" page 105638 is uninitialized --- fixing
WARNING: relation "tbl" page 114550 is uninitialized --- fixing
WARNING: relation "tbl" page 118380 is uninitialized --- fixing
WARNING: relation "tbl" page 121478 is uninitialized --- fixing
WARNING: relation "tbl" page 122098 is uninitialized --- fixing
WARNING: relation "tbl" page 127195 is uninitialized --- fixing
WARNING: relation "tbl" page 152416 is uninitialized --- fixing
WARNING: relation "tbl" page 230192 is uninitialized --- fixing
WARNING: relation "tbl" page 241091 is uninitialized --- fixing
INFO: scanned index "tbl_pkey" to remove 447 row versions

and on the slave I see:

Oct 6 17:04:02 db02 postgres[4535]: [14-1] @ 4535 0: WARNING: page 2943 of relation base/16387/20196 is uninitialized
Oct 6 17:04:02 db02 postgres[4535]: [14-2] @ 4535 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:04:02 db02 postgres[4535]: [15-1] @ 4535 0: PANIC: WAL contains references to invalid pages
Oct 6 17:04:02 db02 postgres[4535]: [15-2] @ 4535 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:04:02 db02 postgres[4529]: [10-1] @ 4529 0: LOG: startup process (PID 4535) was terminated by signal 6: Abort trap
Oct 6 17:04:02 db02 postgres[4529]: [11-1] @ 4529 0: LOG: terminating any other active server processes
Oct 6 17:04:02 db02 postgres[8345]: [26-1] pgsql(at)repmgr 8345 0: WARNING: terminating connection because of crash of another server process
Oct 6 17:04:02 db02 postgres[8345]: [26-2] pgsql(at)repmgr 8345 0: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Oct 6 17:04:02 db02 postgres[8345]: [26-3] pgsql(at)repmgr 8345 0: HINT: In a moment you should be able to reconnect to the database and repeat your command.

Restarting the slave, I see:

Oct 6 17:11:52 db02 postgres[15191]: [12-1] @ 15191 0: LOG: consistent recovery state reached at 9E/E3DDE2B0
Oct 6 17:11:52 db02 postgres[15185]: [9-1] @ 15185 0: LOG: database system is ready to accept read only connections
Oct 6 17:12:03 db02 postgres[15191]: [13-1] @ 15191 0: WARNING: page 2943 of relation base/16387/20196 is uninitialized
Oct 6 17:12:03 db02 postgres[15191]: [13-2] @ 15191 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:12:03 db02 postgres[15191]: [14-1] @ 15191 0: PANIC: WAL contains references to invalid pages
Oct 6 17:12:03 db02 postgres[15191]: [14-2] @ 15191 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:12:03 db02 postgres[15185]: [10-1] @ 15185 0: LOG: startup process (PID 15191) was terminated by signal 6: Abort trap
Oct 6 17:12:03 db02 postgres[15185]: [11-1] @ 15185 0: LOG: terminating any other active server processes

At this point in time, I'm suspecting full_page_writes=off is the culprit and it's violating logic in several places. I haven't tracked things down further, but I thought I'd at least bring this to people's attention so that the appropriate warnings can be put in place. The slave can not be recovered with the corrupted WAL entries, however, and they need to be resynced with full_page_writes enabled (at least that's my assumption at present):

Oct 6 17:26:34 db02 postgres[19614]: [12-1] @ 19614 0: LOG: consistent recovery state reached at 9E/E3DDE2B0
Oct 6 17:26:34 db02 postgres[19608]: [11-1] @ 19608 0: LOG: database system is ready to accept read only connections
Oct 6 17:26:44 db02 postgres[19614]: [13-1] @ 19614 0: WARNING: page 2943 of relation base/16387/20196 is uninitialized
Oct 6 17:26:44 db02 postgres[19614]: [13-2] @ 19614 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:26:44 db02 postgres[19614]: [14-1] @ 19614 0: PANIC: WAL contains references to invalid pages
Oct 6 17:26:44 db02 postgres[19614]: [14-2] @ 19614 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:26:44 db02 postgres[19608]: [12-1] @ 19608 0: LOG: startup process (PID 19614) was terminated by signal 6: Abort trap
Oct 6 17:26:44 db02 postgres[19608]: [13-1] @ 19608 0: LOG: terminating any other active server processes

Cheers. I'll report back if I dig anything else up. -sc

[ plz CC as I'm not subscribed to the list ]]

--
Sean Chittenden
sean(at)chittenden(dot)org


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Sean Chittenden <sean(at)chittenden(dot)org>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...
Date: 2012-10-06 19:43:27
Message-ID: 7609.1349552607@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Sean Chittenden <sean(at)chittenden(dot)org> writes:
> While the master was still being setup and data was being loaded, the
> system ran out of open file descriptors and Pg crashed a few times
> until this was realized and corrected.

Could you provide a bit more detail than that? Usually the behavior we
see under out-of-FDs conditions is that everything else on the system
starts to fall over badly, but PG keeps soldiering along to some extent.
I would not expect data corruption to result, anyway.

If you've got the postmaster logs from this episode, it would be useful
to see what complaints got logged.

regards, tom lane


From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...
Date: 2012-10-07 00:58:24
Message-ID: EFE82544-2AA5-4899-9FA3-F092D26BF3D4@chittenden.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

>> While the master was still being setup and data was being loaded, the
>> system ran out of open file descriptors and Pg crashed a few times
>> until this was realized and corrected.
>
> Could you provide a bit more detail than that? Usually the behavior we
> see under out-of-FDs conditions is that everything else on the system
> starts to fall over badly, but PG keeps soldiering along to some extent.
> I would not expect data corruption to result, anyway.

Yup, I ran out of time earlier and wasn't able to dig them out, but I just recovered them. It looks like there are two error modes that I ran in to as a result of running out of available FDs.

> If you've got the postmaster logs from this episode, it would be useful
> to see what complaints got logged.

These failures are against 9.2.1, btw.

The first crash scenario:

Oct 5 15:00:24 db01 postgres[75852]: [6449-2] javafail(at)dbcluster 75852 0: STATEMENT: SELECT /* query */ FROM tbl AS this_ WHERE this_.user_id=$1
Oct 5 15:00:24 db01 postgres[75852]: [6456-1] javafail(at)dbcluster 75852 0: ERROR: could not seek to end of file "base/16387/20013": Too many open files in system
[snip - lots of could not seek to end of file errors. How does seek(2) consume a file descriptor??? ]
Oct 5 15:00:24 db01 postgres[75852]: [6480-1] javafail(at)dbcluster 75852 0: ERROR: could not open file "base/16387/20275": Too many open files in system
Oct 5 15:00:24 db01 postgres[75852]: [6480-2] javafail(at)dbcluster 75852 0: STATEMENT: SELECT /* some other query */ FROM tbl2 AS this_ WHERE this_.user_id=$1 limit $2
Oct 5 15:00:25 db01 postgres[75852]: [6481-1] javafail(at)dbcluster 75852 0: ERROR: current transaction is aborted, commands ignored until end of transaction block
Oct 5 15:00:25 db01 postgres[75852]: [6481-2] javafail(at)dbcluster 75852 0: STATEMENT: SELECT 1
Oct 5 15:00:25 db01 postgres[76648]: [5944-1] javafail(at)dbcluster 76648 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:00:25 db01 postgres[76649]: [5944-1] javafail(at)dbcluster 76649 0: FATAL: pipe() failed: Too many open files in system
[ snip - this makes sense, pipe(2)'s able to return ENFILE ]
Oct 5 15:00:25 db01 postgres[76407]: [5894-1] javafail(at)dbcluster 76407 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76312]: [6295-1] javafail(at)dbcluster 76312 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76331]: [8612-1] javafail(at)dbcluster 76331 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76407]: [5894-2] javafail(at)dbcluster 76407 0: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Oct 5 15:00:25 db01 postgres[76312]: [6295-2] javafail(at)dbcluster 76312 0: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Oct 5 15:00:25 db01 postgres[76315]: [7317-1] javafail(at)dbcluster 76315 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76407]: [5894-3] javafail(at)dbcluster 76407 0: HINT: In a moment you should be able to reconnect to the database and repeat your command.
Oct 5 15:00:25 db01 postgres[76302]: [10249-1] javafail(at)dbcluster 76302 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76331]: [8612-2] javafail(at)dbcluster 76331 0: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Oct 5 15:00:25 db01 postgres[76286]: [7471-1] javafail(at)dbcluster 76286 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76307]: [7628-1] javafail(at)dbcluster 76307 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76285]: [7034-1] javafail(at)dbcluster 76285 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76331]: [8612-3] javafail(at)dbcluster 76331 0: HINT: In a moment you should be able to reconnect to the database and repeat your command.
[snip]
Oct 5 15:00:28 db01 postgres[76875]: [5949-1] javafail(at)dbcluster 76875 0: FATAL: the database system is in recovery mode
[snip]
Oct 5 15:01:28 db01 postgres[76888]: [5951-1] @ 76888 0: WARNING: could not open statistics file "pg_stat_tmp/pgstat.stat": Too many open files in system
Oct 5 15:01:28 db01 postgres[76888]: [5953-1] @ 76888 0: WARNING: could not open statistics file "pg_stat_tmp/pgstat.stat": Too many open files in system
Oct 5 15:01:28 db01 postgres[76888]: [5955-1] @ 76888 0: WARNING: could not open statistics file "pg_stat_tmp/pgstat.stat": Too many open files in system
Oct 5 15:01:28 db01 postgres[76888]: [5957-1] @ 76888 0: WARNING: could not open statistics file "pg_stat_tmp/pgstat.stat": Too many open files in system
Oct 5 15:01:28 db01 postgres[76888]: [5959-1] @ 76888 0: WARNING: could not open statistics file "pg_stat_tmp/pgstat.stat": Too many open files in system
[snip]
Oct 5 15:01:31 db01 postgres[76888]: [6503-1] @ 76888 0: WARNING: could not open statistics file "pg_stat_tmp/pgstat.stat": Too m
any open files in system
Oct 5 15:01:31 db01 postgres[77571]: [5978-1] @ 77571 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:01:31 db01 postgres[77572]: [5979-1] javafail(at)dbcluster 77572 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:01:31 db01 postgres[77572]: [5980-1] javafail(at)dbcluster 77572 0: WARNING: terminating connection because of crash of anothe
r server process
Oct 5 15:01:31 db01 postgres[77572]: [5980-2] javafail(at)dbcluster 77572 0: DETAIL: The postmaster has commanded this server process t
o roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Oct 5 15:01:31 db01 postgres[77572]: [5980-3] javafail(at)dbcluster 77572 0: HINT: In a moment you should be able to reconnect to the d
atabase and repeat your command.
Oct 5 15:01:31 db01 postgres[77544]: [7490-1] javafail(at)dbcluster 77544 0: WARNING: terminating connection because of crash of anothe
r server process
[snip]
Oct 5 15:01:34 db01 postgres[77811]: [5983-1] javafail(at)dbcluster 77811 0: FATAL: the database system is in recovery mode
Oct 5 15:02:20 db01 postgres[78473]: [5987-1] javafail(at)dbcluster 78473 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:02:20 db01 postgres[78466]: [5989-1] javafail(at)dbcluster 78466 0: WARNING: terminating connection because of crash of another server process

What's odd to me is not the failure scenarios that come from a system running out of FDs (though seek(2)'ing consuming an FD seems odd), it's more that it's still possible for a master DB's VACUUM to clean up a bogus or partial page write, and have the slave crash when the WAL entry is shipped over.

Are there any other bits of info that you want me to hold on to, extract, or things you want me to test? -sc

--
Sean Chittenden
sean(at)chittenden(dot)org


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Sean Chittenden <sean(at)chittenden(dot)org>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...
Date: 2012-10-07 16:08:36
Message-ID: 12579.1349626116@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Sean Chittenden <sean(at)chittenden(dot)org> writes:
>> If you've got the postmaster logs from this episode, it would be useful
>> to see what complaints got logged.

> The first crash scenario:

> Oct 5 15:00:24 db01 postgres[75852]: [6449-2] javafail(at)dbcluster 75852 0: STATEMENT: SELECT /* query */ FROM tbl AS this_ WHERE this_.user_id=$1
> Oct 5 15:00:24 db01 postgres[75852]: [6456-1] javafail(at)dbcluster 75852 0: ERROR: could not seek to end of file "base/16387/20013": Too many open files in system
> [snip - lots of could not seek to end of file errors. How does seek(2) consume a file descriptor??? ]

It doesn't, but FileSeek() might need to do an open if the file wasn't
currently open. This isn't that surprising.

> Oct 5 15:00:25 db01 postgres[76648]: [5944-1] javafail(at)dbcluster 76648 0: FATAL: pipe() failed: Too many open files in system

This message must be coming from initSelfPipe(), and after poking around
a bit I think the failure must be occurring while a new backend is
attempting to do "OwnLatch(&MyProc->procLatch)" in InitProcess. The
reason the postmaster treats this as a crash is that the new backend
just armed the dead-man switch (MarkPostmasterChildActive) but it exits
without doing ProcKill which would disarm it. So this is just an
order-of-operations bug in InitProcess: we're assuming that it can't
fail before reaching "on_shmem_exit(ProcKill, 0)", and the latch
additions broke that. (Though looking at it, assuming that the
PGSemaphoreReset call cannot fail seems a tad risky too.)

So that explains the crashes, but it doesn't (directly) explain why you
had data corruption.

I think the uninitialized pages are showing up because you had crashes
in the midst of relation-extension operations, ie, some other backend
had successfully done an smgrextend but hadn't yet laid down any valid
data in the new page. However, this theory would not explain more than
one uninitialized page per crash, and your previous message seems to
show rather a lot of uninitialized pages. How many pipe-failed crashes
did you have?

> What's odd to me is not the failure scenarios that come from a system running out of FDs (though seek(2)'ing consuming an FD seems odd), it's more that it's still possible for a master DB's VACUUM to clean up a bogus or partial page write, and have the slave crash when the WAL entry is shipped over.

It looks to me like vacuumlazy.c doesn't bother to emit a WAL record at
all when fixing an all-zeroes heap page. I'm not sure if that's a
problem or not. The first actual use of such a page ought to result in
re-init'ing it anyway (cf XLOG_HEAP_INIT_PAGE logic in heap_insert),
so right offhand I don't see a path from this to the slave-side failures
you saw. (But on the other hand I'm only firing on one cylinder today
because of a head cold, so maybe I'm just missing it.)

Do the slave-side failures correspond to pages that were reported as
"fixed" on the master?

regards, tom lane


From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...
Date: 2012-10-07 22:19:41
Message-ID: A4E49A32-EF2B-49E7-BB56-024539464E76@chittenden.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> So that explains the crashes, but it doesn't (directly) explain why you
> had data corruption.

The first crash message:

Oct 5 15:00:21 db01 postgres[75852]: [6406-1] failjava(at)dbcluster 75852 0: ERROR: could not seek to end of file "base/16387/18770": Too many open files in system
Oct 5 15:00:22 db01 postgres[75852]: [6413-1] failjava(at)dbcluster 75852 0: ERROR: could not seek to end of file "base/16387/20259": Too many open files in system
Oct 5 15:00:23 db01 postgres[75852]: [6424-1] failjava(at)dbcluster 75852 0: ERROR: could not seek to end of file "base/16387/20218": Too many open files in system
Oct 5 15:00:23 db01 postgres[75852]: [6431-1] failjava(at)dbcluster 75852 0: ERROR: could not seek to end of file "base/16387/20218": Too many open files in system
Oct 5 15:00:24 db01 postgres[75852]: [6442-1] failjava(at)dbcluster 75852 0: ERROR: could not seek to end of file "base/16387/20218": Too many open files in system
Oct 5 15:00:24 db01 postgres[75852]: [6449-1] failjava(at)dbcluster 75852 0: ERROR: could not open file "base/16387/20302": Too many open files in system
Oct 5 15:00:24 db01 postgres[75852]: [6456-1] failjava(at)dbcluster 75852 0: ERROR: could not seek to end of file "base/16387/20013": Too many open files in system
Oct 5 15:00:24 db01 postgres[75852]: [6463-1] failjava(at)dbcluster 75852 0: ERROR: could not seek to end of file "base/16387/20259": Too many open files in system
Oct 5 15:00:24 db01 postgres[75852]: [6470-1] failjava(at)dbcluster 75852 0: ERROR: could not seek to end of file "base/16387/20259": Too many open files in system
Oct 5 15:00:24 db01 postgres[75852]: [6480-1] failjava(at)dbcluster 75852 0: ERROR: could not open file "base/16387/20275": Too many open files in system
Oct 5 15:00:25 db01 postgres[75852]: [6481-1] failjava(at)dbcluster 75852 0: ERROR: current transaction is aborted, commands ignored until end of transaction block
Oct 5 15:00:25 db01 postgres[76648]: [5944-1] failjava(at)dbcluster 76648 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:00:25 db01 postgres[76649]: [5944-1] failjava(at)dbcluster 76649 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:00:25 db01 postgres[76407]: [5894-1] failjava(at)dbcluster 76407 0: WARNING: terminating connection because of crash of another server process

For the sake of completeness, I looked to correlate the above relations with the tables that had problems, and there is zero correlation between seek(2) errors and tables that had VACUUM reported invalid pages.

nspname | relname | relnamespace | relfilenode | relkind | oid
-----------------+-----------------+--------------+-------------+---------+-------
failjava_master | tbl_F | 16388 | 16402 | r | 16402
failjava_db1 | tbl_D | 16417 | 18473 | r | 18473
failjava_db1 | tbl_E | 16417 | 20179 | r | 20179
failjava_db1 | tbl_C | 16417 | 20187 | r | 20187
failjava_db1 | tbl_B | 16417 | 20200 | r | 20200
failjava_db1 | tbl_A | 16417 | 41731 | r | 19533

But I'm wondering if it died a second time before the system had finished its recovery:

Oct 5 15:01:31 db01 postgres[76888]: [6503-1] @ 76888 0: WARNING: could not open statistics file "pg_stat_tmp/pgstat.stat": Too many open files in system
Oct 5 15:01:31 db01 postgres[77571]: [5978-1] @ 77571 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:01:31 db01 postgres[77572]: [5979-1] failjava(at)dbcluster 77572 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:01:31 db01 postgres[77572]: [5980-1] failjava(at)dbcluster 77572 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:01:31 db01 postgres[77544]: [7490-1] failjava(at)dbcluster 77544 0: WARNING: terminating connection because of crash of another server process

Unfortunately the logs that I have for the first set of crashes are limited to WARNING+ and don't include INFO log levels (though I have it later on. Due to various IRL constraints - hardware failure of the intended host - the provisioning of the master was a bit of a cannonball dash).

> I think the uninitialized pages are showing up because you had crashes
> in the midst of relation-extension operations, ie, some other backend
> had successfully done an smgrextend but hadn't yet laid down any valid
> data in the new page. However, this theory would not explain more than
> one uninitialized page per crash, and your previous message seems to
> show rather a lot of uninitialized pages. How many pipe-failed crashes
> did you have?

Assuming ~1min per recovery, it looks like around 38.

Oct 5 15:00:25 db01 postgres[76648]: [5944-1] failjava(at)dbcluster 76648 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:00:25 db01 postgres[76649]: [5944-1] failjava(at)dbcluster 76649 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:01:31 db01 postgres[77571]: [5978-1] @ 77571 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:01:31 db01 postgres[77572]: [5979-1] failjava(at)dbcluster 77572 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:02:20 db01 postgres[78473]: [5987-1] failjava(at)dbcluster 78473 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:03:47 db01 postgres[79408]: [6005-1] failjava(at)dbcluster 79408 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:05:06 db01 postgres[80440]: [6112-1] failjava(at)dbcluster 80440 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:06:09 db01 postgres[81276]: [6117-1] @ 81276 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:07:04 db01 postgres[82102]: [6131-1] failjava(at)dbcluster 82102 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:08:19 db01 postgres[82336]: [6136-1] @ 82336 0: FATAL: could not open archive status directory "pg_xlog/archive_status": Too many open files in system
Oct 5 15:08:20 db01 postgres[83084]: [6161-1] @ 83084 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:08:22 db01 postgres[83091]: [6184-1] failjava(at)dbcluster 83091 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:09:36 db01 postgres[84096]: [6218-1] failjava(at)dbcluster 84096 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:10:50 db01 postgres[85064]: [6226-1] failjava(at)dbcluster 85064 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:12:05 db01 postgres[86106]: [6233-1] failjava(at)dbcluster 86106 0: FATAL: could not open relation mapping file "global/pg_filenode.map": Too many open files in system
Oct 5 15:12:05 db01 postgres[86107]: [6233-1] failjava(at)dbcluster 86107 0: FATAL: could not open relation mapping file "global/pg_filenode.map": Too many open files in system
Oct 5 15:12:05 db01 postgres[86108]: [6233-1] failjava(at)dbcluster 86108 0: FATAL: could not open relation mapping file "global/pg_filenode.map": Too many open files in system
Oct 5 15:12:05 db01 postgres[86109]: [6233-1] failjava(at)dbcluster 86109 0: FATAL: could not open relation mapping file "global/pg_filenode.map": Too many open files in system
Oct 5 15:12:05 db01 postgres[86105]: [6234-1] failjava(at)dbcluster 86105 0: FATAL: could not open file "base/16387/PG_VERSION": Too many open files in system
Oct 5 15:12:08 db01 postgres[86119]: [6252-1] failjava(at)dbcluster 86119 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:13:36 db01 postgres[87126]: [6295-1] failjava(at)dbcluster 87126 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:14:23 db01 postgres[88122]: [6364-1] failjava(at)dbcluster 88122 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:15:31 db01 postgres[88370]: [6369-1] @ 88370 0: FATAL: could not open archive status directory "pg_xlog/archive_status": Too many open files in system
Oct 5 15:15:31 db01 postgres[89057]: [6398-1] @ 89057 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:15:34 db01 postgres[89067]: [6426-1] failjava(at)dbcluster 89067 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:16:38 db01 postgres[90080]: [6430-1] @ 90080 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:17:19 db01 postgres[90941]: [6446-1] failjava(at)dbcluster 90941 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:18:43 db01 postgres[91898]: [6484-1] failjava(at)dbcluster 91898 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:20:02 db01 postgres[93068]: [6514-1] failjava(at)dbcluster 93068 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:22:03 db01 postgres[94275]: [6519-1] failjava(at)dbcluster 94275 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:22:03 db01 postgres[94276]: [6519-1] failjava(at)dbcluster 94276 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:23:39 db01 postgres[95392]: [6525-1] failjava(at)dbcluster 95392 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:24:52 db01 postgres[96371]: [6533-1] failjava(at)dbcluster 96371 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:26:22 db01 postgres[97337]: [6570-1] failjava(at)dbcluster 97337 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:27:48 db01 postgres[98419]: [6607-1] failjava(at)dbcluster 98419 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:29:02 db01 postgres[99379]: [6706-1] failjava(at)dbcluster 99379 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:30:20 db01 postgres[467]: [6719-1] failjava(at)dbcluster 467 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:32:08 db01 postgres[1479]: [6769-1] failjava(at)dbcluster 1479 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:33:28 db01 postgres[2714]: [6776-1] failjava(at)dbcluster 2714 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:34:18 db01 postgres[3600]: [6870-1] failjava(at)dbcluster 3600 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:36:07 db01 postgres[4608]: [6896-1] failjava(at)dbcluster 4608 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:37:18 db01 postgres[5652]: [6944-1] failjava(at)dbcluster 5652 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:38:56 db01 postgres[6640]: [7018-1] failjava(at)dbcluster 6640 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:39:46 db01 postgres[7680]: [7025-1] failjava(at)dbcluster 7680 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:41:08 db01 postgres[8609]: [7081-1] failjava(at)dbcluster 8609 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:42:20 db01 postgres[9546]: [7095-1] failjava(at)dbcluster 9546 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:43:53 db01 postgres[10468]: [7127-1] failjava(at)dbcluster 10468 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:45:32 db01 postgres[11634]: [7159-1] failjava(at)dbcluster 11634 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:47:10 db01 postgres[12721]: [7184-1] failjava(at)dbcluster 12721 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:48:47 db01 postgres[13831]: [7194-1] failjava(at)dbcluster 13831 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:49:47 db01 postgres[14912]: [7217-1] failjava(at)dbcluster 14912 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:51:32 db01 postgres[15928]: [7223-1] failjava(at)dbcluster 15928 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:52:34 db01 postgres[16954]: [7231-1] failjava(at)dbcluster 16954 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:53:49 db01 postgres[17963]: [7247-1] failjava(at)dbcluster 17963 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:55:36 db01 postgres[19309]: [7269-1] failjava(at)dbcluster 19309 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:57:08 db01 postgres[20273]: [7286-1] failjava(at)dbcluster 20273 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:58:34 db01 postgres[20616]: [7291-1] @ 20616 0: FATAL: could not open archive status directory "pg_xlog/archive_status": Too many open files in system
Oct 5 15:58:34 db01 postgres[21404]: [7293-1] @ 21404 0: FATAL: could not open archive status directory "pg_xlog/archive_status": Too many open files in system
Oct 5 15:58:47 db01 postgres[21442]: [7294-1] @ 21442 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:58:49 db01 postgres[21450]: [7318-1] failjava(at)dbcluster 21450 0: FATAL: pipe() failed: Too many open files in system
[ Once I setup monitoring, I became aware that the app (in a 100% idle state) was consuming ~80K fds because the app's initialization procedure touches nearly all 500 tables across it's ~2K connections ]

<rant>It would be exceptionally nice if JDBC grew a brain and be written in an intelligent manner such that pgbouncer could be used in transaction mode, thereby allowing exceptionally poorly written apps and database connection code which establish 5-20 connections per thread could be parked on the front of pgbouncer instead of needing to talk directly to a backend process. Use of duplicate names for prepared statements, or preparing statements outside of a transaction really ties the hands of folk who are attempting to insert additional layers in the stack to scale up applications that are effectively in an immutable and broken state of disrepair. <pipedream>A more realistic but disgusting solution would be to teach pgbouncer to intercept all prepared statements from a given client and on-the-fly rewrite the names to something that acts as a unique cookie and could be reused across backends if a given backend hasn't seen a unique prepared statement before.</pipedream></rant>

I'm trying to determine that now. I actually only included a small snippet of the uninitialized pages. It looks like there are 28 total uninitialized pages that VACUUM discovered:

[ autovacuum appears to have caught these two, however the slave wasn't turned up at this point ]
Oct 5 19:42:44 db01 postgres[37748]: [8908-1] @ 37748 0: WARNING: relation "tbl_E" page 4263 is uninitialized --- fixing
Oct 5 19:43:03 db01 postgres[37324]: [8908-1] @ 37324 0: WARNING: relation "tbl_B" page 2628 is uninitialized --- fixing

[ And the remaining 26 were caught via an explicitly executed `vacuumdb -az` ]
Oct 6 17:03:24 db01 postgres[73864]: [8911-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_F" page 65722 is uninitialized --- fixing
Oct 6 17:03:43 db01 postgres[73864]: [8912-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_D" page 3563 is uninitialized --- fixing
Oct 6 17:03:43 db01 postgres[73864]: [8913-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_D" page 3564 is uninitialized --- fixing
Oct 6 17:03:43 db01 postgres[73864]: [8914-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_D" page 3565 is uninitialized --- fixing
Oct 6 17:03:49 db01 postgres[73864]: [8915-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 65305 is uninitialized --- fixing
Oct 6 17:03:50 db01 postgres[73864]: [8916-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 70927 is uninitialized --- fixing
Oct 6 17:03:50 db01 postgres[73864]: [8917-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 75286 is uninitialized --- fixing
Oct 6 17:03:50 db01 postgres[73864]: [8918-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 86601 is uninitialized --- fixing
Oct 6 17:03:50 db01 postgres[73864]: [8919-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 91434 is uninitialized --- fixing
Oct 6 17:03:50 db01 postgres[73864]: [8920-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 91435 is uninitialized --- fixing
Oct 6 17:03:51 db01 postgres[73864]: [8921-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 95639 is uninitialized --- fixing
Oct 6 17:03:51 db01 postgres[73864]: [8922-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 105638 is uninitialized --- fixing
Oct 6 17:03:51 db01 postgres[73864]: [8923-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 114550 is uninitialized --- fixing
Oct 6 17:03:51 db01 postgres[73864]: [8924-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 118380 is uninitialized --- fixing
Oct 6 17:03:51 db01 postgres[73864]: [8925-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 121478 is uninitialized --- fixing
Oct 6 17:03:51 db01 postgres[73864]: [8926-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 122098 is uninitialized --- fixing
Oct 6 17:03:52 db01 postgres[73864]: [8927-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 127195 is uninitialized --- fixing
Oct 6 17:03:53 db01 postgres[73864]: [8928-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 152416 is uninitialized --- fixing
Oct 6 17:03:55 db01 postgres[73864]: [8929-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 230192 is uninitialized --- fixing
Oct 6 17:03:56 db01 postgres[73864]: [8930-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_C" page 241091 is uninitialized --- fixing
Oct 6 17:20:31 db01 postgres[73864]: [8931-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_A" page 32379133 is uninitialized --- fixing
Oct 6 17:20:32 db01 postgres[73864]: [8932-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_A" page 32385954 is uninitialized --- fixing
Oct 6 17:20:32 db01 postgres[73864]: [8933-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_A" page 32393740 is uninitialized --- fixing
Oct 6 17:20:32 db01 postgres[73864]: [8934-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_A" page 32394599 is uninitialized --- fixing
Oct 6 17:20:32 db01 postgres[73864]: [8935-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_A" page 32402510 is uninitialized --- fixing
Oct 6 17:20:33 db01 postgres[73864]: [8936-1] pgsql(at)dbcluster 73864 0: WARNING: relation "tbl_A" page 32409328 is uninitialized --- fixing

>> What's odd to me is not the failure scenarios that come from a system running out of FDs (though seek(2)'ing consuming an FD seems odd), it's more that it's still possible for a master DB's VACUUM to clean up a bogus or partial page write, and have the slave crash when the WAL entry is shipped over.
>
> It looks to me like vacuumlazy.c doesn't bother to emit a WAL record at
> all when fixing an all-zeroes heap page. I'm not sure if that's a
> problem or not. The first actual use of such a page ought to result in
> re-init'ing it anyway (cf XLOG_HEAP_INIT_PAGE logic in heap_insert),
> so right offhand I don't see a path from this to the slave-side failures
> you saw. (But on the other hand I'm only firing on one cylinder today
> because of a head cold, so maybe I'm just missing it.)

(Good luck w/ the recovery)

> Do the slave-side failures correspond to pages that were reported as
> "fixed" on the master?

It doesn't appear so.

Oct 6 17:12:03 db02 postgres[15191]: [13-1] @ 15191 0: WARNING: page 2943 of relation base/16387/20196 is uninitialized

I do see a few other odd or related messages on the slave, however:

[ from a recovery ]
Oct 6 07:14:50 db02 postgres[11972]: [12-1] @ 11972 0: LOG: record with zero length at 9E/699842C0
Oct 6 07:14:50 db02 postgres[11972]: [13-1] @ 11972 0: LOG: redo done at 9E/69984290
Oct 6 07:14:50 db02 postgres[11972]: [14-1] @ 11972 0: LOG: last completed transaction was at log time 2012-10-06 07:12:11.87694+00
Oct 6 07:14:50 db02 postgres[11972]: [15-1] @ 11972 0: LOG: checkpoint starting: end-of-recovery immediate

[ Best to let sleeping dragons lie, I shut the slave down for the night in case there was some form of feedback that could be sent back that could hang the master ]
Oct 6 07:51:32 db02 postgres[15505]: [16-1] @ 15505 0: LOG: database system is shut down

[ I think this is from trying to connect to an ipv6 addr for localhost, but IPv6 has been disabled. Including because the verbosity level seems excessive. ]
Oct 6 16:36:06 db02 postgres[4529]: [7-1] @ 4529 0: LOG: could not create socket for statistics collector: Protocol not supported
Oct 6 16:36:06 db02 postgres[4529]: [8-1] @ 4529 0: LOG: trying another address for the statistics collector

[ Here the database noted the bogus page, but it was still able to start up ]
Oct 6 16:36:06 db02 postgres[4535]: [9-1] @ 4535 0: LOG: database system was interrupted; last known up at 2012-10-06 08:07:41 UTC
Oct 6 16:36:06 db02 postgres[4535]: [10-1] @ 4535 0: LOG: entering standby mode
Oct 6 16:36:06 db02 postgres[4535]: [11-1] @ 4535 0: LOG: unexpected pageaddr 88/EE000000 in log file 158, segment 110, offset 0

[ Including this as a small nit: there are two lines emitted for the following log entry, which seems incorrect ]
Oct 6 16:41:31 db02 postgres[6416]: [9-1] @ 6416 0: FATAL: could not connect to the primary server: fe_sendauth: no password supplied
Oct 6 16:41:31 db02 postgres[6416]: [9-2]

Oct 6 16:41:36 db02 postgres[6445]: [9-1] @ 6445 0: LOG: streaming replication successfully connected to primary
Oct 6 16:41:37 db02 postgres[4535]: [12-1] @ 4535 0: LOG: redo starts at 9E/6E000020
Oct 6 16:41:37 db02 postgres[4535]: [13-1] @ 4535 0: LOG: consistent recovery state reached at 9E/6E003E10
Oct 6 16:41:37 db02 postgres[4529]: [9-1] @ 4529 0: LOG: database system is ready to accept read only connections

[ Here's what the slave sees on a crash ]
Oct 6 17:09:17 db02 postgres[15561]: [10-1] pgsql(at)postgres 15561 0: FATAL: the database system is starting up
Oct 6 17:11:52 db02 postgres[15191]: [12-1] @ 15191 0: LOG: consistent recovery state reached at 9E/E3DDE2B0
Oct 6 17:11:52 db02 postgres[15185]: [9-1] @ 15185 0: LOG: database system is ready to accept read only connections
Oct 6 17:12:03 db02 postgres[15191]: [13-1] @ 15191 0: WARNING: page 2943 of relation base/16387/20196 is uninitialized
Oct 6 17:12:03 db02 postgres[15191]: [13-2] @ 15191 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:12:03 db02 postgres[15191]: [14-1] @ 15191 0: PANIC: WAL contains references to invalid pages
Oct 6 17:12:03 db02 postgres[15191]: [14-2] @ 15191 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:12:03 db02 postgres[15185]: [10-1] @ 15185 0: LOG: startup process (PID 15191) was terminated by signal 6: Abort trap
[ /me scratches head for a bit ]
Oct 6 17:21:34 db02 postgres[19614]: [9-1] @ 19614 0: LOG: database system was interrupted while in recovery at log time 2012-10-06 08:07:24 UTC
Oct 6 17:21:34 db02 postgres[19614]: [9-2] @ 19614 0: HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
Oct 6 17:21:34 db02 postgres[19614]: [10-1] @ 19614 0: LOG: entering standby mode
Oct 6 17:21:34 db02 postgres[19614]: [11-1] @ 19614 0: LOG: redo starts at 9E/6E000020

[ This is the point that I came up with my hypothesis and enabled full_page_writes ]
Oct 6 17:24:14 db02 postgres[19608]: [9-1] @ 19608 0: LOG: received SIGHUP, reloading configuration files
Oct 6 17:24:14 db02 postgres[19608]: [10-1] @ 19608 0: LOG: parameter "full_page_writes" changed to "on"
Oct 6 17:26:34 db02 postgres[19614]: [12-1] @ 19614 0: LOG: consistent recovery state reached at 9E/E3DDE2B0
Oct 6 17:26:34 db02 postgres[19608]: [11-1] @ 19608 0: LOG: database system is ready to accept read only connections
Oct 6 17:26:44 db02 postgres[19614]: [13-1] @ 19614 0: WARNING: page 2943 of relation base/16387/20196 is uninitialized
Oct 6 17:26:44 db02 postgres[19614]: [13-2] @ 19614 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:26:44 db02 postgres[19614]: [14-1] @ 19614 0: PANIC: WAL contains references to invalid pages
Oct 6 17:26:44 db02 postgres[19614]: [14-2] @ 19614 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:26:44 db02 postgres[19608]: [12-1] @ 19608 0: LOG: startup process (PID 19614) was terminated by signal 6: Abort trap
Oct 6 17:26:44 db02 postgres[19608]: [13-1] @ 19608 0: LOG: terminating any other active server processes
[ Given things were still panic'ing on the same page of data, I enabled full_page_writes on the master, performed a VACUUM, CHECKPOINT and begin resync'ing data to the slave ]

[ And here I started up the slave again ]
Oct 7 00:32:00 db02 postgres[66904]: [9-1] @ 66904 0: LOG: database system was interrupted; last known up at 2012-10-06 17:48:10 UTC
Oct 7 00:32:00 db02 postgres[66904]: [10-1] @ 66904 0: LOG: entering standby mode
Oct 7 00:32:00 db02 postgres[66904]: [11-1] @ 66904 0: LOG: unexpected pageaddr 89/B9000000 in log file 159, segment 55, offset 0
Oct 7 00:32:00 db02 postgres[66906]: [9-1] [unknown](at)[unknown] 66906 0: LOG: connection received: host=[local]
Oct 7 00:32:00 db02 postgres[66906]: [10-1] pgsql(at)postgres 66906 0: FATAL: the database system is starting up
Oct 7 00:32:00 db02 postgres[66905]: [9-1] @ 66905 0: LOG: streaming replication successfully connected to primary
Oct 7 00:32:01 db02 postgres[66913]: [9-1] [unknown](at)[unknown] 66913 0: LOG: connection received: host=[local]
Oct 7 00:32:01 db02 postgres[66913]: [10-1] pgsql(at)postgres 66913 0: FATAL: the database system is starting up
Oct 7 00:32:01 db02 postgres[66904]: [12-1] @ 66904 0: LOG: redo starts at 9F/36000020
Oct 7 00:32:01 db02 postgres[66904]: [13-1] @ 66904 0: LOG: consistent recovery state reached at 9F/36003FD8
Oct 7 00:32:02 db02 postgres[66897]: [9-1] @ 66897 0: LOG: database system is ready to accept read only connections

[ At this point I initiated a VACUUM and rebuilt a 400GB table using pg_reorg(1) to evict all WAL files that would contain possibly poisonous data ]

And that's about all I know at this point in time. Things have been up and running successfully and without issue ever since. I'm tempted to pg_reorg(1) all of the tables in order to guarantee that all tables are 100% intact since I can't easily VACUUM FULL these tables (several of the largest, most frequently used are several TB in size). LMK if there's anything else you're interested in seeing or having outlined. Again, the master recovered gracefully, didn't require any manual intervention other than increasing kern.maxfiles, and hasn't skipped a beat once since kern.maxfiles was increased. Having the slaves crap out when the master executed a VACUUM, however, is of interest. When running the manual VACUUM, I had a window open to both db01 and db02 and db02's backend panic'ing was clearly linked to the VACUUM warnings on the db01.

In the regression testing environments, is it possible to enable a test mode that runs with full_page_writes disabled in order to push the boundaries of disabling full_page_writes? The IO bandwidth savings are substantial for small mutations, and for folks lucky enough to run on ZFS (i.e. not ufs, ext*), it'd be really cool to be able to realize those bandwidth savings. As is, it appears that standalone instances of Pg are full_page_writes safe, but that slaves are still making assumptions about full page WAL entries being the MTU during some corner cases. My observations lead me to believe that replication works fine with full_page_writes disabled until the master cleans up a zero-filled or damaged page.

For the archives (if anyone makes it this far to the bottom of the note), what I did to operationally "fix" things:

1) On the master I re-enabled full_page_writes and HUP'ed pg to reload the setting.
2) Fired off a VACUUM ANALYZE to generate some WAL data. This step isn't necessary, but I wanted to make sure that there was something to CHECKPOINT.
3) Issue a CHECKPOINT & wait until this is completed.
4) Re-sync data to the slave (via rsync(1) or whatever your tool of choice is).
5) Restarted the slave.
6) Verify that the slave is in fact replicating via pg_stat_replication

-sc

--
Sean Chittenden
sean(at)chittenden(dot)org


From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...
Date: 2012-10-11 03:03:26
Message-ID: 57B4F1DB-D8E1-4195-86C4-F66E61915B84@chittenden.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

>> Oct 5 15:00:25 db01 postgres[76648]: [5944-1] javafail(at)dbcluster 76648 0: FATAL: pipe() failed: Too many open files in system
>
> This message must be coming from initSelfPipe(), and after poking around
> a bit I think the failure must be occurring while a new backend is
> attempting to do "OwnLatch(&MyProc->procLatch)" in InitProcess. The
> reason the postmaster treats this as a crash is that the new backend
> just armed the dead-man switch (MarkPostmasterChildActive) but it exits
> without doing ProcKill which would disarm it. So this is just an
> order-of-operations bug in InitProcess: we're assuming that it can't
> fail before reaching "on_shmem_exit(ProcKill, 0)", and the latch
> additions broke that. (Though looking at it, assuming that the
> PGSemaphoreReset call cannot fail seems a tad risky too.)
>
> So that explains the crashes, but it doesn't (directly) explain why you
> had data corruption.

I've since been able to repeat this with full_page_writes=on and have had identical corruption, so I don't think this is full_page_writes related any more. I can also confirm that there was just one crash by the master database because it required manual intervention to bring back up (a backup label was sitting stale in the data dir[1]).

The slaves died during the post-crash VACUUM just as earlier. I'm completing the VACUUM now and am seeing multiple warnings from VACUUM.

WARNING: relation "tbl_a" page 2115352 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115353 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115354 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115355 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115356 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115357 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115358 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115359 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115360 is uninitialized --- fixing

On one of the slaves:

Oct 11 00:17:36 db02 postgres[66904]: [21-1] @ 66904 0: WARNING: page 120547 of relation base/16387/20196 is uninitialized
Oct 11 00:17:36 db02 postgres[66904]: [21-2] @ 66904 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 125016, lastBlockVacuumed 0
Oct 11 00:17:36 db02 postgres[66904]: [22-1] @ 66904 0: PANIC: WAL contains references to invalid pages
Oct 11 00:17:36 db02 postgres[66904]: [22-2] @ 66904 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 125016, lastBlockVacuumed 0
Oct 11 00:17:36 db02 postgres[66897]: [10-1] @ 66897 0: LOG: startup process (PID 66904) was terminated by signal 6: Abort trap
Oct 11 00:17:36 db02 postgres[66897]: [11-1] @ 66897 0: LOG: terminating any other active server processes

Hopefully this is helpful information. -sc

[1] The backup label was residual from the initial sync and wasn't cleaned up for some reason during a pg_stop_backup(). For now I'm chalking this up as a bug in repmgr even though repmgr completed cloning the slave successfully (supposedly).

START WAL LOCATION: 9F/36000020 (file 000000010000009F00000036)
CHECKPOINT LOCATION: 9F/37E9D6D8
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2012-10-06 17:48:10 UTC
LABEL: repmgr_standby_clone_1349545601

--
Sean Chittenden
sean(at)chittenden(dot)org