Re: WIP: Failover Slots

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Oleksii Kliukin <alexk(at)hintbits(dot)com>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: WIP: Failover Slots
Date: 2016-02-24 10:02:09
Message-ID: CAMsr+YEacvQiHb1HG1KVwGb2QUBSmku5FeaFT4RFgttoG4a1Rw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24 February 2016 at 03:53, Oleksii Kliukin <alexk(at)hintbits(dot)com> wrote:

>
> I found the following issue when shutting down a master with a connected
> replica that uses a physical failover slot:
>
> 2016-02-23 20:33:42.546 CET,,,54998,,56ccb3f3.d6d6,3,,2016-02-23 20:33:07
> CET,,0,DEBUG,00000,"performing replication slot checkpoint",,,,,,,,,""
> 2016-02-23 20:33:42.594 CET,,,55002,,56ccb3f3.d6da,4,,2016-02-23 20:33:07
> CET,,0,DEBUG,00000,"archived transaction log file
> ""000000010000000000000003""",,,,,,,,,""
> 2016-02-23 20:33:42.601 CET,,,54998,,56ccb3f3.d6d6,4,,2016-02-23 20:33:07
> CET,,0,PANIC,XX000,"concurrent transaction log activity while database
> system is shutting down",,,,,,,,,""
> 2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,5,,2016-02-23 20:33:07
> CET,,0,LOG,00000,"checkpointer process (PID 54998) was terminated by signal
> 6: Abort trap",,,,,,,,,""
> 2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,6,,2016-02-23 20:33:07
> CET,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,
>
>
Odd that I didn't see that in my testing. Thanks very much for this. I
concur with your explanation.

Basically, the issue is that CreateCheckPoint calls
> CheckpointReplicationSlots, which currently produces WAL, and this violates
> the assumption at line xlog.c:8492
>
> if (shutdown && checkPoint.redo != ProcLastRecPtr)
> ereport(PANIC,
> (errmsg("concurrent transaction log activity while database system is
> shutting down")));
>

Interesting problem.

It might be reasonably harmless to omit writing WAL for failover slots
during a shutdown checkpoint. We're using WAL to move data to the replicas
but we don't really need it for local redo and correctness on the master.
The trouble is that we do of course redo failover slot updates on the
master and we don't really want a slot to go backwards vs its on-disk state
before a crash. That's not too harmful - but might be able to lead to us
losing a slot catalog_xmin increase so the slot thinks catalog is still
readable that could've actually been vacuumed away.

CheckpointReplicationSlots notes that:

* This needn't actually be part of a checkpoint, but it's a convenient
* location.

... and I suspect the answer there is simply to move the slot checkpoint to
occur prior to the WAL checkpoint rather than during it. I'll investigate.

I really want to focus on the first patch, timeline following for logical
slots. That part is much less invasive and is useful stand-alone. I'll move
it to a separate CF entry and post it to a separate thread as I think it
needs consideration independently of failover slots.

(BTW, the slot docs promise that slots will replay a change exactly once,
but this is not correct and the client must keep track of replay position.
I'll post a patch to correct it separately).

> There are a couple of incorrect comments
>

Thanks, will amend.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2016-02-24 12:40:51 Re: GIN data corruption bug(s) in 9.6devel
Previous Message Artur Zakirov 2016-02-24 09:48:43 Re: plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types