Assertion failure in walreceiver

Lists: pgsql-hackers
From: Greg Stark <stark(at)mit(dot)edu>
To: "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Assertion failure in walreceiver
Date: 2010-02-23 14:46:40
Message-ID: 407d949e1002230646v37673e01s52269d75084011ba@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I tried to set up a simple master/slave setup and immediately ran into
this assertion failure. The slave is just a cold copy of the database
immediately after initdb. The first WAL segment hasn't been archived
yet. It sees that the first archive fail hasn't been archived yet,
starts up walreceiver but it looks like the start point hasn't been
initialized yet because it hasn't processed any checkpoint WAL records
yet.

$ /usr/local/pgsql/bin/postgres -D /var/tmp/pg85/s/
LOG: database system was shut down at 2010-02-23 14:30:08 GMT
cp: cannot stat `/var/tmp/pg85/w/000000010000000000000000': No such
file or directory
TRAP: FailedAssertion("!(startpoint.xlogid != 0 || startpoint.xrecoff
!= 0)", File: "libpqwalreceiver.c", Line: 87)
LOG: WAL receiver process (PID 3536) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
LOG: startup process (PID 3534) exited with exit code 2
LOG: aborting startup due to startup process failure

--
greg


From: Greg Stark <stark(at)mit(dot)edu>
To: "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure in walreceiver
Date: 2010-02-24 18:18:15
Message-ID: 407d949e1002241018n73682466vb87fd788b5ddc9fd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Did anyone see this? This seems like a pretty grave problem in
streaming replication.

On Tue, Feb 23, 2010 at 2:46 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> I tried to set up a simple master/slave setup and immediately ran into
> this assertion failure. The slave is just a cold copy of the database
> immediately after initdb. The first WAL segment hasn't been archived
> yet. It sees that the first archive fail hasn't been archived yet,
> starts up walreceiver but it looks like the start point hasn't been
> initialized yet because it hasn't processed any checkpoint WAL records
> yet.
>
> $ /usr/local/pgsql/bin/postgres -D /var/tmp/pg85/s/
> LOG:  database system was shut down at 2010-02-23 14:30:08 GMT
> cp: cannot stat `/var/tmp/pg85/w/000000010000000000000000': No such
> file or directory
> TRAP: FailedAssertion("!(startpoint.xlogid != 0 || startpoint.xrecoff
> != 0)", File: "libpqwalreceiver.c", Line: 87)
> LOG:  WAL receiver process (PID 3536) was terminated by signal 6: Aborted
> LOG:  terminating any other active server processes
> LOG:  startup process (PID 3534) exited with exit code 2
> LOG:  aborting startup due to startup process failure
>
>
> --
> greg
>

--
greg


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure in walreceiver
Date: 2010-02-24 20:29:54
Message-ID: 4B858C42.4080501@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark wrote:
> I tried to set up a simple master/slave setup and immediately ran into
> this assertion failure. The slave is just a cold copy of the database
> immediately after initdb. The first WAL segment hasn't been archived
> yet. It sees that the first archive fail hasn't been archived yet,
> starts up walreceiver but it looks like the start point hasn't been
> initialized yet because it hasn't processed any checkpoint WAL records
> yet.
>
> $ /usr/local/pgsql/bin/postgres -D /var/tmp/pg85/s/
> LOG: database system was shut down at 2010-02-23 14:30:08 GMT
> cp: cannot stat `/var/tmp/pg85/w/000000010000000000000000': No such
> file or directory
> TRAP: FailedAssertion("!(startpoint.xlogid != 0 || startpoint.xrecoff
> != 0)", File: "libpqwalreceiver.c", Line: 87)
> LOG: WAL receiver process (PID 3536) was terminated by signal 6: Aborted
> LOG: terminating any other active server processes
> LOG: startup process (PID 3534) exited with exit code 2
> LOG: aborting startup due to startup process failure

Hmm, interesting corner case. That assertion is checking that the
streaming starting point is initialized correctly, but 0/0 is actually a
valid value when you start streaming from the very first WAL segment.

I guess we can just remove the assertion. If an incorrect zero value
finds its way there due to some bug, nothing particularly bad will
happen; you'll just get an error because the WAL segment most likely
doesn't exist in the primary anymore.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure in walreceiver
Date: 2010-02-25 02:33:02
Message-ID: 3f0b79eb1002241833g31a1f28fy82dce25f9ad1cc7c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 25, 2010 at 5:29 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Hmm, interesting corner case. That assertion is checking that the
> streaming starting point is initialized correctly, but 0/0 is actually a
> valid value when you start streaming from the very first WAL segment.
>
> I guess we can just remove the assertion. If an incorrect zero value
> finds its way there due to some bug, nothing particularly bad will
> happen; you'll just get an error because the WAL segment most likely
> doesn't exist in the primary anymore.

And we need to remove the following code from XLogSend(). This code is
useless now since the initial CopyData message is guaranteed to arrive
at the primary before doing XLogSend().

> /*
> * Invalid position means that we have not yet received the initial
> * CopyData message from the slave that indicates where to start the
> * streaming.
> */
> if (sentPtr.xlogid == 0 &&
> sentPtr.xrecoff == 0)
> return true;

I have one question. Do we support starting an archive recovery and
standby server from a cold backup (not a base backup taken by online
backup)? Though I think they would work and be very useful, I'm not
sure they are safe.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure in walreceiver
Date: 2010-02-25 07:31:45
Message-ID: 4B862761.9000703@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao wrote:
> And we need to remove the following code from XLogSend(). This code is
> useless now since the initial CopyData message is guaranteed to arrive
> at the primary before doing XLogSend().
>
>> /*
>> * Invalid position means that we have not yet received the initial
>> * CopyData message from the slave that indicates where to start the
>> * streaming.
>> */
>> if (sentPtr.xlogid == 0 &&
>> sentPtr.xrecoff == 0)
>> return true;

Yeah, good catch.

Committed removal of that and the assertion. You still can't use a copy
of the data directory taken right after initdb, because the initial
checkpoint record has the flag set indicating that archiving is not
enabled. While we're at it, the error message doesn't seem right:

FATAL: recovery connections cannot start because the
recovery_connections parameter is disabled on the WAL source server

recovery_connections is on by default, the real problem is that
archive_command and max_wal_senders are disabled.

> I have one question. Do we support starting an archive recovery and
> standby server from a cold backup (not a base backup taken by online
> backup)? Though I think they would work and be very useful, I'm not
> sure they are safe.

I don't see why not. We support that in PITR, streaming replication is
just another way of getting the logs to the server.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure in walreceiver
Date: 2010-02-25 08:04:05
Message-ID: 3f0b79eb1002250004h72a9535ck40943826dc449cf7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 25, 2010 at 4:31 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> I have one question. Do we support starting an archive recovery and
>> standby server from a cold backup (not a base backup taken by online
>> backup)? Though I think they would work and be very useful, I'm not
>> sure they are safe.
>
> I don't see why not. We support that in PITR, streaming replication is
> just another way of getting the logs to the server.

I thought that because, AFAIK, there is no document about initial startup
of PITR from a cold backup. But I'd be happy if it's supported. After
failover happens, previous primary server might be able to become standby
from its old data without taking a new backup from new primary.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Greg Stark <stark(at)mit(dot)edu>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure in walreceiver
Date: 2010-02-25 13:41:59
Message-ID: 407d949e1002250541w6f8d50f4sceeda9640c53313f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 25, 2010 at 7:31 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Committed removal of that and the assertion. You still can't use a copy
> of the data directory taken right after initdb, because the initial
> checkpoint record has the flag set indicating that archiving is not
> enabled. While we're at it, the error message doesn't seem right:
>
> FATAL:  recovery connections cannot start because the
> recovery_connections parameter is disabled on the WAL source server
>
> recovery_connections is on by default, the real problem is that
> archive_command and max_wal_senders are disabled.

Perhaps we need to put these flags in a record on startup. If they're
not set on the checkpoint you start at you check if the next record is
a shutdown and it starts up with the flags set.

I'm not sure that's exactly right as I've never looked at the wal
sequence on shutdown and startup. But it seems like a problem if you
want to start replication, find that archive_mode needs to be set so
you restart your database with it set but then still can't start up
the slave until a checkpoint happens on the master.

--
greg


From: Greg Stark <stark(at)mit(dot)edu>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure in walreceiver
Date: 2010-02-26 03:11:11
Message-ID: 407d949e1002251911y4998255av84931a1b033f5e49@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 25, 2010 at 7:31 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> While we're at it, the error message doesn't seem right:
>
> FATAL:  recovery connections cannot start because the
> recovery_connections parameter is disabled on the WAL source server
>
> recovery_connections is on by default, the real problem is that
> archive_command and max_wal_senders are disabled.

So do I understand this right, if you have archive_mode disabled and
try to start a slave you get this error. Then when you shut down your
master and set archive_mode on and bring it up and try again you'll
*still* get this message because the last checkpoint will be the final
shutdown checkpoint where archive_mode was still disabled?

--
greg


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure in walreceiver
Date: 2010-02-26 14:31:57
Message-ID: 4B87DB5D.6020305@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark wrote:
> So do I understand this right, if you have archive_mode disabled and
> try to start a slave you get this error. Then when you shut down your
> master and set archive_mode on and bring it up and try again you'll
> *still* get this message because the last checkpoint will be the final
> shutdown checkpoint where archive_mode was still disabled?

Right.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, "<pgsql-hackers(at)postgresql(dot)org>" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Assertion failure in walreceiver
Date: 2010-02-26 14:35:13
Message-ID: 4B87DC21.6040202@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark wrote:
> On Thu, Feb 25, 2010 at 7:31 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Committed removal of that and the assertion. You still can't use a copy
>> of the data directory taken right after initdb, because the initial
>> checkpoint record has the flag set indicating that archiving is not
>> enabled. While we're at it, the error message doesn't seem right:
>>
>> FATAL: recovery connections cannot start because the
>> recovery_connections parameter is disabled on the WAL source server
>>
>> recovery_connections is on by default, the real problem is that
>> archive_command and max_wal_senders are disabled.
>
> Perhaps we need to put these flags in a record on startup. If they're
> not set on the checkpoint you start at you check if the next record is
> a shutdown and it starts up with the flags set.

Seems reasonable, though if you're unlucky the startup record goes into
the next WAL segment, which might not exist on the standby yet. So I
don't think you can "peek ahead" to see if the record exists, but you
could downgrade the fatal error to a warning, and refrain from opening
for read-only connections until you see the startup record.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com