Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Maciek Sakrejda <maciek(at)heroku(dot)com>
Cc: Daniel Farina <daniel(at)heroku(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #7883: "PANIC: WAL contains references to invalid pages" on replica recovery
Date: 2013-02-21 12:04:09
Message-ID: 51260D39.5020505@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 19.02.2013 00:19, Maciek Sakrejda wrote:
> On Mon, Feb 18, 2013 at 12:57 AM, Heikki Linnakangas<
> hlinnakangas(at)vmware(dot)com> wrote:
>
>> On 16.02.2013 01:49, Daniel Farina wrote:
>>
>>> I guess that means Ubuntu (and probably Debian?) libpq-dev breaks
>>> PG_VERSION_NUM for PGXS=1.
>>>
>>
>> That obviously needs to be fixed in debian. Meanwhile, Maciek, I'd suggest
>> that you build PostgreSQL from sources, install it to some temporary
>> location, and then build xlogdump against that.
>
> That worked, thanks. I have a working xlogdump. Any pointers as to what I
> should look for? This is the contents of the pg_xlog directory:
>
> total 49160
> -rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:00
> 000000010000003C00000093
> -rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:47
> 000000010000003C00000094
> -rw------- 1 udrehggpif7kft postgres 16777216 Feb 15 00:49
> 000000020000003C00000093
> -rw------- 1 udrehggpif7kft postgres 56 Feb 15 00:49 00000002.history
> drwx------ 2 udrehggpif7kft postgres 4096 Feb 15 00:49 archive_status

I'd like to see the contents of the WAL, starting from the last
checkpoint, up to the point where failover happened. In particular, any
actions on the relation base/16385/16430, which caused the error.
pg_controldata output on the base backup would also interesting, as well
as the contents of backup_label file.

How long did the standby run between the base backup and the failover?
How many WAL segments?

One more thing you could try to narrow down the error: restore from the
base backup, and let it run up to the point of failover, but shut it
down just before the failover with "pg_ctl stop -m fast". That should
create a restartpoint, at the latest checkpoint record. Then restart,
and perform failover. If it still throws the same error, we know that
the WAL record that touched the page that doesn't exist was after the
last checkpoint.

- Heikki

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Jeff Janes 2013-02-21 17:25:31 Re: new BUG: "postgresql 9.2.3: very long query time"
Previous Message Claude Speed 2013-02-21 08:55:06 Re: new BUG: "postgresql 9.2.3: very long query time"