Re: Standby stopped working after PANIC: WAL contains references to invalid pages

From: Lonni J Friedman <netllama(at)gmail(dot)com>
To: Dan Kogan <dan(at)iqtell(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Standby stopped working after PANIC: WAL contains references to invalid pages
Date: 2013-06-23 02:08:43
Message-ID: CAP=oouHD5UEoTuJTrWPFZgQ5fr3PKvb7h0dp+dL5boWMumyuRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Assuming that you still have $PGDATA from the broken instance (such
that you can reproduce the crash again), there might be a way to debug
it further. I'd guess that something like bad RAM or storage could
cause an index to get corrupted in this fashion, but the fact that
you're using AWS makes that less likely. Someone far more
knowledgeable than I will need to provide guidance on how to debug
this though.

On Sat, Jun 22, 2013 at 4:17 PM, Dan Kogan <dan(at)iqtell(dot)com> wrote:
> Re-seeding the standby with a full base backup does seem to make the error go away.
> The standby started, caught up and has been working for about 2 hours.
>
> The file in the error message was an index. We rebuilt it just in case.
> Is there any way to debug the issue at this point?
>
>
>
> -----Original Message-----
> From: Lonni J Friedman [mailto:netllama(at)gmail(dot)com]
> Sent: Saturday, June 22, 2013 4:11 PM
> To: Dan Kogan
> Cc: pgsql-general(at)postgresql(dot)org
> Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages
>
> Looks like some kind of data corruption. Question is whether it came from the master, or was created by the standby. If you re-seed the standby with a full (base) backup, does the problem go away?
>
> On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan(at)iqtell(dot)com> wrote:
>> Hello,
>>
>>
>>
>> Today our standby instance stopped working with this error in the log:
>>
>>
>>
>> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING: page 158130 of
>> relation
>> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>>
>> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT: xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC: WAL contains
>> references to invalid pages
>>
>> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT: xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG: startup process (PID
>> 8367) was terminated by signal 6: Aborted
>>
>> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG: terminating any other
>> active server processes
>>
>>
>>
>> After re-start the same exact error occurred.
>>
>>
>>
>> We thought that maybe we hit this bug -
>> http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>>
>> However, there is nothing in our log about sub-transactions, so it
>> didn't seem the same to us.
>>
>>
>>
>> Any advice on how to further debug this so we can avoid this in the
>> future is appreciated.
>>
>>
>>
>> Environment:
>>
>>
>>
>> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>>
>>
>>
>> Software and settings:
>>
>>
>>
>> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc
>> (Ubuntu/Linaro
>> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>>
>>
>>
>> archive_command rsync -a %p
>> slave:/var/lib/postgresql/replication_load/%f
>>
>> archive_mode on
>>
>> autovacuum_freeze_max_age 1000000000
>>
>> autovacuum_max_workers 6
>>
>> checkpoint_completion_target 0.9
>>
>> checkpoint_segments 128
>>
>> checkpoint_timeout 30min
>>
>> default_text_search_config pg_catalog.english
>>
>> hot_standby on
>>
>> lc_messages en_US.UTF-8
>>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message itishree sukla 2013-06-23 05:32:51 Re: File System backup
Previous Message Dan Kogan 2013-06-22 23:17:11 Re: Standby stopped working after PANIC: WAL contains references to invalid pages