Re: Errors with physical replication

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: greigwise(at)comcast(dot)net
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Errors with physical replication
Date: 2018-05-23 01:34:09
Message-ID: 20180523.103409.61588279.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello.

At Mon, 21 May 2018 05:18:57 -0700 (MST), greigwise <greigwise(at)comcast(dot)net> wrote in <1526905137308-0(dot)post(at)n3(dot)nabble(dot)com>
> Hello.
>
> We are on Postgresql version 9.6.6. We have 2 EC2 instances in different
> Amazon regions and we are doing physical replication via VPN. It all seems
> to work just fine most of the time. I'm noticing in the logs that we have
> recurring erros (maybe 10 or 12 times per day) that look like this:

<following is digested>

> 2018-05-17 06:36:14 UTC 5af0599f.210d LOG: invalid resource manager ID 49
> 2018-05-17 06:36:14 UTC 5afd22de.7ac4 LOG: started streaming WAL from
> 2018-05-17 07:20:17 UTC 5afd22de.7ac4 FATAL: could not receive data from
> WAL stream: server closed the connection unexpectedly

> Or some that also look like this:
>
> 2018-05-17 07:20:17 UTC 5af0599f.210d LOG: record with incorrect prev-link
> 2018-05-17 07:20:18 UTC 5afd2d31.1889 LOG: started streaming WAL from
> 2018-05-17 08:03:28 UTC 5afd2d31.1889 FATAL: could not receive data from
> WAL stream: server closed the connection unexpectedly

> And some like this:
>
> 2018-05-17 23:00:13 UTC 5afd63ec.26fc LOG: invalid magic number 0000 in
> log segment 00000001000003850000003C, offset 10436608
> 2018-05-17 23:00:14 UTC 5afe097d.49aa LOG: started streaming WAL from
> primary at 385/3C000000 on timeline 1

You recplication connection seems quite unstable and disconnected
frequently. After disconnection, you will see several kinds of "I
find a broken record in my WAL file" and they are cues for
standby to switch to streaming. Itself is a normal operation as
PostgreSQL with one known exception.

> Then, like maybe once every couple months or so, we have a crash with logs
> looking like this:

> 2018-05-17 08:03:28 UTC hireology 5af47b75.2670 hireology WARNING:
> terminating connection because of crash of another server process

I think the lines follow an error message like "FATAL: invalid
memory alloc request size 3075129344". This is a kind of "broken
record" but it is known to lead standby to crash. It is
disucussed here.

> [bug fix] Cascaded standby cannot start after a clean shutdown

https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F8B57AD(at)G01JPEXMBYT05#0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05

> When this last error occurs, the recovery is to go on the replica and remove
> all the WAL logs from the pg_xlog director and then restart Postgresql.
> Everything seems to recover and come up fine. I've done some tests
> comparing counts between the replica and the primary and everything seems
> synced just fine from all I can tell.

It is right recovery steps, as far as looking the attached log
messages.

> So, a couple of questions. 1) Should I be worried that my replica is
> corrupt in some way or given that everything *seems* ok, is it reasonable to
> believe that things are working correctly in spite of these errors being
> reported. 2) Is there something I should configure differently to avoid
> some of these errors?

It doesn't seem worth warrying from the viewpoint of data
integrity, but if walsender/walreceiver timeouts fire too
frequently, you might need to increase them for increased
stability.

> Thanks in advance for any help.
>
> Greig Wise

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Berend Tober 2018-05-23 03:14:20 Per-document document statistics
Previous Message Andres Freund 2018-05-22 23:39:58 Re: Error on vacuum: xmin before relfrozenxid