Re: WIP: Failover Slots

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Oleksii Kliukin <alexk(at)hintbits(dot)com>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: WIP: Failover Slots
Date: 2016-02-23 10:01:42
Message-ID: CAMsr+YGoaAVhOC2ja+pZbHs3Zux5W4G5JtDMZ-XbYDTRmM-0YA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 22 February 2016 at 23:39, Oleksii Kliukin <alexk(at)hintbits(dot)com> wrote:

> What it’s doing is calling pg_basebackup first to initialize the replica,
> and that actually failed with:
>
> _basebackup: unexpected termination of replication stream: ERROR:
> requested WAL segment 000000010000000000000000 has already been removed
>
> The segment name definitely looks bogus to me.
>
> The actual command causing the failure was an attempt to clone the replica
> using pg_basebackup, turning on xlog streaming:
>
> pg_basebackup --pgdata data/postgres1 --xlog-method=stream
> --dbname="host=localhost port=5432 user=replicator”
>
> I checked the same command against the git master without the patches
> applied and could not reproduce this problem there.
>

That's a bug. In testing whether we need to return a lower LSN for minimum
WAL for BASE_BACKUP it failed to properly test for InvalidXLogRecPtr . Good
catch.

> On the code level, I have no comments on 0001, it’s well documented and I
> have no questions about the approach, although I might be not too
> knowledgable to judge the specifics of the implementation.
>

The first patch is the most important IMO, and the one I think needs the
most thought since it's ... well, timelines aren't simple.

> slots.c:294
> elog(LOG, "persistency is %i", (int)slot->data.persistency);
>
> Should be changed to DEBUG?
>

That's an escapee log statement I thought I'd already rebased out. Well
spotted, fixed.

>
> slot.c:468
> Why did you drop “void" as a parameter type of ReplicationSlotDropAcquired?
>

That's an editing error on my part that I'll reverse. Since the prototype
declares (void) it doesn't matter, but it's a pointless change. Fixed.

> walsender.c: 1509 at PhysicalConfirmReceivedLocation
>
> I’ve noticed a comment stating that we don’t need to call
> ReplicationSlotSave(), but that pre-dated the WAL-logging of replication
> slot changes. Don’t we need to call it now, the same way it’s done for
> the logical slots in logical.c:at LogicalConfirmReceivedLocation?
>

No, it's safe here. All we must ensure is that a slot is advanced on the
replica when it's advanced on the master. For physical slots even that's a
weak requirement, we just have to stop them from falling *too* far behind
and causing too much xlog retention. For logical slots we should ensure we
advance the slot on the replica before any vacuum activity that might
remove catalog tuples still needed by that slot gets replayed. Basically
the difference is that logical slots keep track of the catalog xmin too, so
they have (slightly) stricter requirements.

This patch doesn't touch either of those functions except for
renaming ReplicationSlotsComputeRequiredLSN
to ReplicationSlotsUpdateRequiredLSN . Which, by the way, I really don't
like doing, but I couldn't figure out a name to give the function that
computes-and-returns the required LSN that wouldn't be even more confusing
in the face of having a ReplicationSlotsComputeRequiredLSN function as
well. Ideas welcome.

Updated patch

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2016-02-23 10:30:01 Re: WIP: Failover Slots
Previous Message Kyotaro HORIGUCHI 2016-02-23 08:44:44 Re: Support for N synchronous standby servers - take 2