From: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> |
---|---|
To: | greigwise(at)comcast(dot)net |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Errors with physical replication |
Date: | 2018-05-23 01:34:09 |
Message-ID: | 20180523.103409.61588279.horiguchi.kyotaro@lab.ntt.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello.
At Mon, 21 May 2018 05:18:57 -0700 (MST), greigwise <greigwise(at)comcast(dot)net> wrote in <1526905137308-0(dot)post(at)n3(dot)nabble(dot)com>
> Hello.
>
> We are on Postgresql version 9.6.6. We have 2 EC2 instances in different
> Amazon regions and we are doing physical replication via VPN. It all seems
> to work just fine most of the time. I'm noticing in the logs that we have
> recurring erros (maybe 10 or 12 times per day) that look like this:
<following is digested>
> 2018-05-17 06:36:14 UTC 5af0599f.210d LOG: invalid resource manager ID 49
> 2018-05-17 06:36:14 UTC 5afd22de.7ac4 LOG: started streaming WAL from
> 2018-05-17 07:20:17 UTC 5afd22de.7ac4 FATAL: could not receive data from
> WAL stream: server closed the connection unexpectedly
> Or some that also look like this:
>
> 2018-05-17 07:20:17 UTC 5af0599f.210d LOG: record with incorrect prev-link
> 2018-05-17 07:20:18 UTC 5afd2d31.1889 LOG: started streaming WAL from
> 2018-05-17 08:03:28 UTC 5afd2d31.1889 FATAL: could not receive data from
> WAL stream: server closed the connection unexpectedly
> And some like this:
>
> 2018-05-17 23:00:13 UTC 5afd63ec.26fc LOG: invalid magic number 0000 in
> log segment 00000001000003850000003C, offset 10436608
> 2018-05-17 23:00:14 UTC 5afe097d.49aa LOG: started streaming WAL from
> primary at 385/3C000000 on timeline 1
You recplication connection seems quite unstable and disconnected
frequently. After disconnection, you will see several kinds of "I
find a broken record in my WAL file" and they are cues for
standby to switch to streaming. Itself is a normal operation as
PostgreSQL with one known exception.
> Then, like maybe once every couple months or so, we have a crash with logs
> looking like this:
> 2018-05-17 08:03:28 UTC hireology 5af47b75.2670 hireology WARNING:
> terminating connection because of crash of another server process
I think the lines follow an error message like "FATAL: invalid
memory alloc request size 3075129344". This is a kind of "broken
record" but it is known to lead standby to crash. It is
disucussed here.
> [bug fix] Cascaded standby cannot start after a clean shutdown
> When this last error occurs, the recovery is to go on the replica and remove
> all the WAL logs from the pg_xlog director and then restart Postgresql.
> Everything seems to recover and come up fine. I've done some tests
> comparing counts between the replica and the primary and everything seems
> synced just fine from all I can tell.
It is right recovery steps, as far as looking the attached log
messages.
> So, a couple of questions. 1) Should I be worried that my replica is
> corrupt in some way or given that everything *seems* ok, is it reasonable to
> believe that things are working correctly in spite of these errors being
> reported. 2) Is there something I should configure differently to avoid
> some of these errors?
It doesn't seem worth warrying from the viewpoint of data
integrity, but if walsender/walreceiver timeouts fire too
frequently, you might need to increase them for increased
stability.
> Thanks in advance for any help.
>
> Greig Wise
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Berend Tober | 2018-05-23 03:14:20 | Per-document document statistics |
Previous Message | Andres Freund | 2018-05-22 23:39:58 | Re: Error on vacuum: xmin before relfrozenxid |