'full_page_writes=off' , VACUUM and crashing streaming slaves...

From: Sean Chittenden <sean(at)chittenden(dot)org>
To: pgsql-general(at)postgresql(dot)org
Subject: 'full_page_writes=off' , VACUUM and crashing streaming slaves...
Date: 2012-10-06 17:32:00
Message-ID: F0D8FB08-5CD7-4905-9F9D-ECDB0E4ADD74@chittenden.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Master/slave replication, the master and slave both have full_page_writes disabled because the system is running on ZFS.

While the master was still being setup and data was being loaded, the system ran out of open file descriptors and Pg crashed a few times until this was realized and corrected. Streaming replication was setup and was correctly passing data to the slave:

repmgr(at)db01=# select * FROM pg_stat_replication ;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+---------+------------------+-----------------+-----------------+-------------+-------------------------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
68057 | 16384 | repmgr | walreceiver | 192.168.100.183 | | 60743 | 2012-10-06 16:41:36.963769+00 | streaming | 9E/E2C81210 | 9E/E2C81210 | 9E/E2C81210 | 9E/E2C80C68 | 0 | async
(1 row)

Once the host was up and running, I kicked off a manual VACUUM, at which point I see a bunch of the following errors on the master:

INFO: vacuuming "schma.tbl"
WARNING: relation "tbl" page 65305 is uninitialized --- fixing
WARNING: relation "tbl" page 70927 is uninitialized --- fixing
WARNING: relation "tbl" page 75286 is uninitialized --- fixing
WARNING: relation "tbl" page 86601 is uninitialized --- fixing
WARNING: relation "tbl" page 91434 is uninitialized --- fixing
WARNING: relation "tbl" page 91435 is uninitialized --- fixing
WARNING: relation "tbl" page 95639 is uninitialized --- fixing
WARNING: relation "tbl" page 105638 is uninitialized --- fixing
WARNING: relation "tbl" page 114550 is uninitialized --- fixing
WARNING: relation "tbl" page 118380 is uninitialized --- fixing
WARNING: relation "tbl" page 121478 is uninitialized --- fixing
WARNING: relation "tbl" page 122098 is uninitialized --- fixing
WARNING: relation "tbl" page 127195 is uninitialized --- fixing
WARNING: relation "tbl" page 152416 is uninitialized --- fixing
WARNING: relation "tbl" page 230192 is uninitialized --- fixing
WARNING: relation "tbl" page 241091 is uninitialized --- fixing
INFO: scanned index "tbl_pkey" to remove 447 row versions

and on the slave I see:

Oct 6 17:04:02 db02 postgres[4535]: [14-1] @ 4535 0: WARNING: page 2943 of relation base/16387/20196 is uninitialized
Oct 6 17:04:02 db02 postgres[4535]: [14-2] @ 4535 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:04:02 db02 postgres[4535]: [15-1] @ 4535 0: PANIC: WAL contains references to invalid pages
Oct 6 17:04:02 db02 postgres[4535]: [15-2] @ 4535 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:04:02 db02 postgres[4529]: [10-1] @ 4529 0: LOG: startup process (PID 4535) was terminated by signal 6: Abort trap
Oct 6 17:04:02 db02 postgres[4529]: [11-1] @ 4529 0: LOG: terminating any other active server processes
Oct 6 17:04:02 db02 postgres[8345]: [26-1] pgsql(at)repmgr 8345 0: WARNING: terminating connection because of crash of another server process
Oct 6 17:04:02 db02 postgres[8345]: [26-2] pgsql(at)repmgr 8345 0: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Oct 6 17:04:02 db02 postgres[8345]: [26-3] pgsql(at)repmgr 8345 0: HINT: In a moment you should be able to reconnect to the database and repeat your command.

Restarting the slave, I see:

Oct 6 17:11:52 db02 postgres[15191]: [12-1] @ 15191 0: LOG: consistent recovery state reached at 9E/E3DDE2B0
Oct 6 17:11:52 db02 postgres[15185]: [9-1] @ 15185 0: LOG: database system is ready to accept read only connections
Oct 6 17:12:03 db02 postgres[15191]: [13-1] @ 15191 0: WARNING: page 2943 of relation base/16387/20196 is uninitialized
Oct 6 17:12:03 db02 postgres[15191]: [13-2] @ 15191 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:12:03 db02 postgres[15191]: [14-1] @ 15191 0: PANIC: WAL contains references to invalid pages
Oct 6 17:12:03 db02 postgres[15191]: [14-2] @ 15191 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:12:03 db02 postgres[15185]: [10-1] @ 15185 0: LOG: startup process (PID 15191) was terminated by signal 6: Abort trap
Oct 6 17:12:03 db02 postgres[15185]: [11-1] @ 15185 0: LOG: terminating any other active server processes

At this point in time, I'm suspecting full_page_writes=off is the culprit and it's violating logic in several places. I haven't tracked things down further, but I thought I'd at least bring this to people's attention so that the appropriate warnings can be put in place. The slave can not be recovered with the corrupted WAL entries, however, and they need to be resynced with full_page_writes enabled (at least that's my assumption at present):

Oct 6 17:26:34 db02 postgres[19614]: [12-1] @ 19614 0: LOG: consistent recovery state reached at 9E/E3DDE2B0
Oct 6 17:26:34 db02 postgres[19608]: [11-1] @ 19608 0: LOG: database system is ready to accept read only connections
Oct 6 17:26:44 db02 postgres[19614]: [13-1] @ 19614 0: WARNING: page 2943 of relation base/16387/20196 is uninitialized
Oct 6 17:26:44 db02 postgres[19614]: [13-2] @ 19614 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:26:44 db02 postgres[19614]: [14-1] @ 19614 0: PANIC: WAL contains references to invalid pages
Oct 6 17:26:44 db02 postgres[19614]: [14-2] @ 19614 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 2944, lastBlockVacuumed 2403
Oct 6 17:26:44 db02 postgres[19608]: [12-1] @ 19608 0: LOG: startup process (PID 19614) was terminated by signal 6: Abort trap
Oct 6 17:26:44 db02 postgres[19608]: [13-1] @ 19608 0: LOG: terminating any other active server processes

Cheers. I'll report back if I dig anything else up. -sc

[ plz CC as I'm not subscribed to the list ]]

--
Sean Chittenden
sean(at)chittenden(dot)org

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Edson Richter 2012-10-06 19:41:11 Help with query timeout
Previous Message Johnny Tan 2012-10-06 12:24:16 Re: [Pgbouncer-general] Again, problem with pgbouncer