Quick Links

Re: WIP: Failover Slots

Lists:	pgsql-hackers

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	WIP: Failover Slots
Date:	2016-01-02 00:50:15
Message-ID:	CANP8+j+qu7NV6QyaN4QgHo9kCD9fg0HhbwFFLpRuFA_6niXZ9g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Failover Slots

If Logical Decoding is taking place from a master, then if we failover and
create a new master we would like to continue the Logical Decoding from the
new master. To facilitate this, we introduce the concept of “Failover
Slots”, which are slots that generate WAL, so that any downstream standby
can continue to produce logical decoding output from that named plugin.

In the current patch, any slot defined on a master will generate WAL,
leading to a pending-slot being present on all standby nodes. When a
standby is promoted, the slot becomes usable and will have the properties
as of the last fsync on the master.

Failover slots are not accessible until promotion of the standby. Logical
slots from a standby looks like a significant step beyond this and will not
be part of this patch.

Internal design is fairly clear, using a new rmgr. No real problems emerged
so far.

Patch is WIP, posted for comment, so you can see where I'm going.

I'm expecting to have a working version including timeline following for
9.6.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
failover_slots.v2.patch	application/octet-stream	25.3 KB

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-01-20 13:02:54
Message-ID:	CAMsr+YFsAbkfqDQ=qQi4pYARfoa5MZ1zgDRSEkH-oOR5Eh=WxA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2 January 2016 at 08:50, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> Patch is WIP, posted for comment, so you can see where I'm going.
>

I've applied this on a branch of master and posted it, with some comment
editorialization, as
https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots . The tree
will be subject to rebasing.

At present the patch does not appear to work. No slots are visible in the
replica's pg_replication_slots before or after promotion and no slot
information is written to the xlog according to pg_xlogdump:

$ ~/pg/96/bin/pg_xlogdump -r ReplicationSlot 000000010000000000000001
000000010000000000000003
pg_xlogdump: FATAL: error in WAL record at 0/301DDE0: invalid record
length at 0/301DE10

so it's very much a WIP. I've read through it and think the idea makes
sense, it's just still missing some pieces...

So. Initial review comments.

This looks pretty incomplete:

+ if (found_duplicate)
+ {
+ LWLockRelease(ReplicationSlotAllocationLock);
+
+ /*
+ * Do something nasty to the sinful duplicants, but
+ * take with locking.
+ */
+
+ LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
+ }

... and I'm not sure I understand how the possibility of a duplicate slot
can arise in the first place, since you cannot create a slot on a read
replica. This seems unnecessary.

I'm not sure I understand why, in ReplicationSlotRedoCreate, it's
especially desirable to prevent blocking iteration over
pg_replication_slots or acquiring a slot. When redoing a slot create isn't
that exactly what we should do? This looks like it's been copied & pasted
verbatim from CheckPointReplicationSlots . There it makes sense, since the
slots may be in active use. During redo it seems reasonable to just
acquire ReplicationSlotControlLock.

I'm not a fan of the ReplicationSlotInWAL typedef
for ReplicationSlotPersistentData. Especially as it's only used in the redo
functions but *not* when xlog is written out. I'd like to just replace it.

Purely for convenient testing there's a shell script in the tree -
https://github.com/2ndQuadrant/postgres/blob/dev/failover-slots/failover-slot-test.sh
.
Assuming a patched 9.6 in $HOME/pg/96 it'll do a run-through of the patch.
I'll attempt to convert this to use the new test infrastructure, but needed
something ASAP for development. Posted in case it's useful to others.

Now it's time to look into why this doesn't seem to be generating any xlog
when by rights it seems like it should. Also into at what point exactly we
purge existing slots on start / promotion of a read-replica.

TL;DR: this doesn't work yet, working on it.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-01-20 14:59:15
Message-ID:	CAMsr+YEmoZhyGgtB+BK7s08BzzEfuBC_2Z7C5JtfFkf-NyLXLw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20 January 2016 at 21:02, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:

> TL;DR: this doesn't work yet, working on it.
>

Nothing is logged on slot creation because ReplicationSlot->data.failover
is never true. Once that's fixed by - for now - making all slots failover
slots, there's a crash in XLogInsert because of the use of reserved bits in
the XLogInsert info argument. Fix pushed.

I also noticed that slot drops seem are being logged whether or not the
slot is a failover slot. Pushed a fix for that.

The WAL writing is now working. I've made improvements to the rmgr xlogdump
support to make it clearer what's written.

Slots are still not visible on the replica so there's work to do tracing
redo, promotion, slot handling after starting from a basebackup, etc. The
patch is still very much WIP.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-01-21 16:31:03
Message-ID:	CA+TgmoZam8QYYsP=u44pQiF+wqMVzWZrTsYWvQ_Okju65SsxxQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Failover Slots
> In the current patch, any slot defined on a master will generate WAL,
> leading to a pending-slot being present on all standby nodes. When a standby
> is promoted, the slot becomes usable and will have the properties as of the
> last fsync on the master.

No objection to the concept, but I think the new behavior needs to be
optional. I am guessing that you are thinking along similar lines,
but it's not explicitly called out here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-01-21 19:03:27
Message-ID:	CANP8+j+Rgz5t-Yttba1Djv56rfhs8y+Wt8YcEPaK8RYZ-V+voQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21 January 2016 at 16:31, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > Failover Slots
> > In the current patch, any slot defined on a master will generate WAL,
> > leading to a pending-slot being present on all standby nodes. When a
> standby
> > is promoted, the slot becomes usable and will have the properties as of
> the
> > last fsync on the master.
>
> No objection to the concept, but I think the new behavior needs to be
> optional. I am guessing that you are thinking along similar lines,
> but it's not explicitly called out here.
>

I was unsure myself; but making them optional seems reasonable.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-01-22 07:46:44
Message-ID:	CAMsr+YHYwGyk4j3C17PZAMzcH6J01AvnZ2EMj6D+MvP_P87Uzg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22 January 2016 at 00:31, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

Yeah, I think that's the idea too. For one thing we'll want to allow
non-failover slots to continue to be usable from a streaming replica, but
we must ERROR if anyone attempts to attach to and replay from a failover
slot via a replica since we can't write WAL there. Both kinds are needed.

There's a 'failover' bool member in the slot persistent data for that
reason. It's not (yet) exposed via the UI.

I presume we'll want to:

* add a new default-false argument is_failover_slot or similar to
pg_create_logical_replication_slot and pg_create_physical_replication_slot

* Add a new optional flag argument FAILOVER to CREATE_REPLICATION_SLOT in
both its LOGICAL and PHYSICAL forms.

... and will be adding that to this patch, barring syntax objections etc.

It's also going to be necessary to handle what happens when a new failover
slot (physical or logical) is created on the master where it conflicts with
the name of a non-failover physical slot that was created on the replica.
In this case I am inclined to terminate any walsender backend for the
replica's slot with a conflict with recovery, remove its slot and replace
it with a failover slot. The failover slot does not permit replay while in
recovery so if the booted-off client reconnects it'll get an ERROR saying
it can't replay from a failover slot. It should be pretty clear to the
admin what's happening between the conflict with recovery and the failover
slot error. There could still be an issue if the client persistently keeps
retrying and successfully reconnects after replica promotion but I don't
see that much can be done about that. The documentation will need to
address the need to try to avoid name conflicts between slots created on
replicas and failover slots on the master.

I'll be working on those docs, on the name conflict handling, and on the
syntax during my coming flight. Toddler permitting.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-01-22 16:40:24
Message-ID:	CA+Tgmoa2Jc2f9SOKoSr+U5LNP6d_rVqYx6ec-2msaTaPCWeARA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 22, 2016 at 2:46 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> It's also going to be necessary to handle what happens when a new failover
> slot (physical or logical) is created on the master where it conflicts with
> the name of a non-failover physical slot that was created on the replica. In
> this case I am inclined to terminate any walsender backend for the replica's
> slot with a conflict with recovery, remove its slot and replace it with a
> failover slot. The failover slot does not permit replay while in recovery so
> if the booted-off client reconnects it'll get an ERROR saying it can't
> replay from a failover slot. It should be pretty clear to the admin what's
> happening between the conflict with recovery and the failover slot error.
> There could still be an issue if the client persistently keeps retrying and
> successfully reconnects after replica promotion but I don't see that much
> can be done about that. The documentation will need to address the need to
> try to avoid name conflicts between slots created on replicas and failover
> slots on the master.

That's not going to win any design-of-the-year awards, but maybe it's
acceptable. It occurred to me to wonder if it might be better to
propagate logical slots partially or entirely outside the WAL stream,
because with this design you will end up with the logical slots on
every replica, including cascaded replicas, and I'm not sure that's
what we want. Then again, I'm also not sure it isn't what we want.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-01-22 16:51:53
Message-ID:	20160122165153.GA1276@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2016-01-22 11:40:24 -0500, Robert Haas wrote:
> It occurred to me to wonder if it might be better to
> propagate logical slots partially or entirely outside the WAL stream,
> because with this design you will end up with the logical slots on
> every replica, including cascaded replicas, and I'm not sure that's
> what we want. Then again, I'm also not sure it isn't what we want.

Not propagating them through the WAL also has the rather large advantage
of not barring the way to using such slots on standbys.

I think it's technically quite possible to maintain the required
resources on multiple nodes. The question is how would you configure on
which nodes the resources need to be maintained? I can't come up with a
satisfying scheme...

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-01-25 14:25:43
Message-ID:	CAMsr+YFhJL=NZ_A6phjDOHt=V42N_k+W3TU6R-8M1_D-Z1aBuA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 23 January 2016 at 00:40, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> It occurred to me to wonder if it might be better to
> propagate logical slots partially or entirely outside the WAL stream,
> because with this design you will end up with the logical slots on
> every replica, including cascaded replicas, and I'm not sure that's
> what we want. Then again, I'm also not sure it isn't what we want.
>

I think it's the most sensible default if there's only going to be one
choice to start with. It's consistent with what we do elsewhere with
replicas so there won't be any surprises. Failover slots are a fairly
simple feature that IMO just makes slots behave more like you might expect
them to do in the first place.

I'm pretty hesitant to start making cascaded replicas different to each
other just for slots. There are lots of other things where variation
between replicas would be lovely - the most obvious of which is omitting
some databases from some replicas. Right now we have a single data stream,
WAL, that goes to every replica. If we're going to change that I'd really
like to address it in a way that'll meet future needs like selective
physical replication too. I also think we'd want to deal with the problem
of identifying and labeling nodes to do a good job of selective replication
of slots.

I'd like to get failover slots in place for 9.6 since the're fairly
self-contained and meet an immediate need: allowing replication using slots
(physical or logical) to follow a failover event.

After that I want to add logical decoding support for slot
creation/drop/advance. So a logical decoding plugin can mirror logical slot
state on another node. It wouldn't be useful for physical slots, of course,
but it'd allow failover between logical replicas where we can do cool
things like replicate just a subset of DBs/tables/etc. (A mapping of
upstream to downstream LSNs would be required, but hopefully not too hard).
That's post-9.6 work and separate to failover slots, though dependent on
them for the ability to decode slot changes from WAL.

Down the track I'd very much like to be less dependent on forcing
everything though WAL with a single timeline. I agree with Andres that
being able to use failover slots on replicas would be good, and that it's
not possible when we use WAL to track the info. I just think it's a massive
increase in complexity and scope and I'd really like to be able to follow
failover the simple way first, then enhance it.

Nothing stops us changing failover slots in 9.7+ from using WAL to some
other mechanism that we design carefully at that time. We can extend the
walsender command set for physical rep at a major version bump with no
major issues, including adding to the streaming rep protocol. There's lots
to figure out though, including how to maintain slot changes in a strict
ordering with WAL, how to store and read the info, etc.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-01-25 19:34:31
Message-ID:	CAMsr+YFXqkAUVF8p6KxkYX1thAddWifm5SeW_8xTJJF8wtC4PA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 23 January 2016 at 00:51, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Not propagating them through the WAL also has the rather large advantage
> of not barring the way to using such slots on standbys.
>

Yeah. So you could have a read-replica that has a slot and it has child
nodes you can fail over to, but you don't have to have the slot on the
master.

I don't personally find that to be a particularly compelling thing that
says "we must have this" ... but maybe I'm not seeing the full
significance/advantages.

> I think it's technically quite possible to maintain the required
> resources on multiple nodes. The question is how would you configure on
> which nodes the resources need to be maintained? I can't come up with a
> satisfying scheme...
>

That's part of it. Also the mechanism by which we actually replicate them -
protocol additions for the walsender protocol, how to reliably send
something that doesn't have an LSN, etc. It might be fairly simple, I
haven't thought about it deeply, but I'd rather not go there until the
basics are in place.

BTW, I'm keeping a working tree at
https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots . Subject
to rebasing, history not clean. It has a test script in it that'll go away
before patch posting.

Current state needs work to ensure that on-disk and in-memory
representations are kept in sync, but is getting there.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Subject:	Re: WIP: Failover Slots
Date:	2016-01-26 16:09:13
Message-ID:	CAMsr+YF+wm=5X3miDEGFmwxH1a8KcLyRq-F8oGDV2A-GhmfTPQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi all

Here's v2 of the failover slots patch. It replicates a logical slot to a
physical streaming replica downstream, keeping the slots in sync. After the
downstream is promoted a client can replay from the logical slot.

UI to allow creation of non-failover slots is pending.

There's more testing to do to cover all the corners: drop slots, drop and
re-create, name conflicts between downstream !failover slots and upstream
failover slots, etc.

There's also a known bug where WAL isn't correctly retained for a slot
where that slot was created before a basebackup which I'll fix in a
revision shortly.

I'm interested in ideas on how to better test this.

--
Craig Ringer http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-Implement-failover-slots.patch	text/x-patch	32.9 KB

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Subject:	Re: WIP: Failover Slots
Date:	2016-01-27 09:39:46
Message-ID:	CAMsr+YFLiz3kOZ0FPYDPjU=O-uuh+qet=74UMWDwOL6NvoPRLg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi all

Here's v3 of failover slots.

It doesn't add the UI yet, but it's now functionally complete except for
timeline following for logical slots, and I have a plan for that.

Attachment	Content-Type	Size
failover-slots-v3.patch	text/x-patch	35.0 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-02-04 03:16:04
Message-ID:	CA+Tgmobo3ucC+baYjGukyBMnRDXkpaQHFDO_2At_CC-6vjM-eg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 22, 2016 at 11:51 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I think it's technically quite possible to maintain the required
> resources on multiple nodes. The question is how would you configure on
> which nodes the resources need to be maintained? I can't come up with a
> satisfying scheme...

For this to work, I feel like the nodes need names, and a directory
that tells them how to reach each other.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Subject:	Re: WIP: Failover Slots
Date:	2016-02-15 05:52:03
Message-ID:	CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi all

I've attached the completed failover slots patch.

There were quite a few details to address so this is split into a three
patch series. I've also attached the test script I've been using.

We need failover slots to allow a logical decoding client to follow
physical failover, allowing HA "above" a logical decoding client.

For reviewers there's some additional explanation for why a few of the
changes are made the way they are on the wiki:
https://wiki.postgresql.org/wiki/Failover_slots . See "Patch notes".

The tagged tree for this submission is at
https://github.com/2ndQuadrant/postgres/tree/failover-slots-v4 .

I intend to backport this to 9.4 and 9.5 (though of course not for mainline
submission!).

Patch 1: add support for timeline following in logical decoding.

This is necessary to make failover slots useful. Otherwise decoding from a
slot will fail after failover because the server tries to read WAL with
ThisTimeLineID but the needed archives are on a historical timeline. While
the smallest part of the patch series this was the most complex.

Patch 2: Failover slots core

* Add WAL logging and redo for failover slots

* copy pg_replslot/ in pg_basebackup

* drop non-failover slots on archive recovery startup

* expand the amount of WAL copied by pg_basebackup so failover slots are
usable after restore

* if a failover slot is created on the primary with the same name as an
existing non-failover slot on replica(s), kill any client connected to the
replica's slot and drop the replica's slot during redo

* Adds a new backup label entry MIN FAILOVER SLOT LSN to generated
backup label files if failover slots are present. This allows utilities
like
pgbarman, omnipitr, etc to know to retain more WAL to preserve the
function of failover slots.

* Return a lower LSN from pg_start_backup() and BASE_BACKUP
if needed to ensure that tools copy the extra WAL required by failover
slots during a base backup.

Relies on timeline following for logical decoding slots to be useful.

Does not add UI (function arguments, walsender syntax, changes to views,
etc) to expose failover slots to users. They can only be used by extensions
that call ReplicationSlotCreate directly.

Patch 3: User interface for failover slots

The 3rd patch adds the UI to expose failover slots to the user:

- A 'failover' boolean argument, default false, to
pg_create_physical_replication_slot(...) and
pg_create_logical_replication_slot(...)
- A new FAILOVER option to PG_CREATE_REPLICATION_SLOT on the walsender
protocol
- A new 'failover' boolean column in pg_catalog.pg_replication_slots
- SGML documentation changes for the new options and for failover slots in
general

Limited tests are also added in this patch since not much of this is really
testable by pg_regress. I've attached my local test script in case it's of
interest/use to anyone.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-Allow-logical-slots-to-follow-timeline-switches.patch	text/x-patch	19.1 KB
0002-Allow-replication-slots-to-follow-failover.patch	text/x-patch	43.7 KB
0003-Add-the-UI-and-documentation-for-failover-slots.patch	text/x-patch	35.8 KB
failover-slot-test	application/octet-stream	10.3 KB

From:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-02-15 17:23:01
Message-ID:	CALLjQTSCHvcsF6y7=ZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

here is my code level review:

0001:
This one looks ok except for broken indentation in the new notes in
xlogreader.c and .h. It's maybe slightly overdocumented but given the
complicated way the timeline reading works it's probably warranted.

0002:
+ /*
+ * No way to wait for the process since it's not a child
+ * of ours and there's no latch to set, so poll.
+ *
+ * We're checking this without any locks held, but
+ * we'll recheck when we attempt to drop the slot.
+ */
+ while (slot->in_use && slot->active_pid == active_pid
+ && max_sleep_micros > 0)
+ {
+ usleep(micros_per_sleep);
+ max_sleep_micros -= micros_per_sleep;
+ }

Not sure I buy this, what about postmaster crashes and fast shutdown
requests etc. Also you do usleep for 10s which is quite long. I'd
prefer the classic wait for latch with timeout and pg crash check
here. And even if we go with usleep, then it should be 1s not 10s and
pg_usleep instead of usleep.

0003:
There is a lot of documentation improvements here that are not related
to failover slots or timeline following, it might be good idea to
split those into separate patch as they are separately useful IMHO.

Other than that it looks good to me.

About other things discussed in this thread. Yes it makes sense in
certain situations to handle this outside of WAL and that does require
notions of nodes, etc. That being said, the timeline following is
needed even if this is handled outside of WAL. And once timeline
following is in, the slots can be handled by the replication solution
itself which is good. But I think the failover slots are still a good
thing to have - it provides HA solution for anything that uses slots,
and that includes physical replication as well. If the specific
logical replication solution wants to handle it for some reason itself
outside of WAL, it can create non-failover slot so in my opinion we
ideally need both types of slots (and that's exactly what this patch
gives us).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-02-16 08:11:56
Message-ID:	CAMsr+YH-ZyzsL8D7pUFbm5u9nXqdHx40CJ9ueVy-Sd3xp_ObZg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 16 February 2016 at 01:23, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:

> Hi,
>
> here is my code level review:
>
> 0001:
> This one looks ok except for broken indentation in the new notes in
> xlogreader.c and .h.

I don't see the broken indentation. Not sure what you mean.

> + while (slot->in_use && slot->active_pid == active_pid
> + && max_sleep_micros > 0)
> + {
> + usleep(micros_per_sleep);
> + max_sleep_micros -= micros_per_sleep;
> + }
>
> Not sure I buy this, what about postmaster crashes and fast shutdown
> requests etc.

Yeah. I was thinking - incorrectly - that I couldn't use a latch during
recovery.

Revision attached. There was a file missing from the patch too.

> 0003:
> There is a lot of documentation improvements here that are not related
> to failover slots or timeline following, it might be good idea to
> split those into separate patch as they are separately useful IMHO.
>

Yeah, probably worth doing. We'll see how this patch goes.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-Allow-logical-slots-to-follow-timeline-switches.patch	text/x-patch	19.1 KB
0002-Allow-replication-slots-to-follow-failover.patch	text/x-patch	48.9 KB
0003-Add-the-UI-and-documentation-for-failover-slots.patch	text/x-patch	35.4 KB

From:	Oleksii Kliukin <alexk(at)hintbits(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-02-22 15:39:57
Message-ID:	FA68178E-F0D1-47F6-9791-8A3E2136C119@hintbits.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

> On 16 Feb 2016, at 09:11, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>
>
>
> Revision attached. There was a file missing from the patch too.
>

All attached patches apply normally. I only took a look at first 2, but also tried to run the Patroni with the modified version to check whether the basic replication works.

What it’s doing is calling pg_basebackup first to initialize the replica, and that actually failed with:

_basebackup: unexpected termination of replication stream: ERROR: requested WAL segment 000000010000000000000000 has already been removed

The segment name definitely looks bogus to me.

The actual command causing the failure was an attempt to clone the replica using pg_basebackup, turning on xlog streaming:

pg_basebackup --pgdata data/postgres1 --xlog-method=stream --dbname="host=localhost port=5432 user=replicator”

I checked the same command against the git master without the patches applied and could not reproduce this problem there.

On the code level, I have no comments on 0001, it’s well documented and I have no questions about the approach, although I might be not too knowledgable to judge the specifics of the implementation.

On the 0002, there are a few rough edges:

slots.c:294
elog(LOG, "persistency is %i", (int)slot->data.persistency);

Should be changed to DEBUG?

slot.c:468
Why did you drop “void" as a parameter type of ReplicationSlotDropAcquired?

walsender.c: 1509 at PhysicalConfirmReceivedLocation

I’ve noticed a comment stating that we don’t need to call ReplicationSlotSave(), but that pre-dated the WAL-logging of replication slot changes. Don’t we need to call it now, the same way it’s done for the logical slots in logical.c:at LogicalConfirmReceivedLocation?

Kind regards,
--
Oleksii

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Oleksii Kliukin <alexk(at)hintbits(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-02-23 10:01:42
Message-ID:	CAMsr+YGoaAVhOC2ja+pZbHs3Zux5W4G5JtDMZ-XbYDTRmM-0YA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22 February 2016 at 23:39, Oleksii Kliukin <alexk(at)hintbits(dot)com> wrote:

> What it’s doing is calling pg_basebackup first to initialize the replica,
> and that actually failed with:
>
> _basebackup: unexpected termination of replication stream: ERROR:
> requested WAL segment 000000010000000000000000 has already been removed
>
> The segment name definitely looks bogus to me.
>
> The actual command causing the failure was an attempt to clone the replica
> using pg_basebackup, turning on xlog streaming:
>
> pg_basebackup --pgdata data/postgres1 --xlog-method=stream
> --dbname="host=localhost port=5432 user=replicator”
>
> I checked the same command against the git master without the patches
> applied and could not reproduce this problem there.
>

That's a bug. In testing whether we need to return a lower LSN for minimum
WAL for BASE_BACKUP it failed to properly test for InvalidXLogRecPtr . Good
catch.

> On the code level, I have no comments on 0001, it’s well documented and I
> have no questions about the approach, although I might be not too
> knowledgable to judge the specifics of the implementation.
>

The first patch is the most important IMO, and the one I think needs the
most thought since it's ... well, timelines aren't simple.

> slots.c:294
> elog(LOG, "persistency is %i", (int)slot->data.persistency);
>
> Should be changed to DEBUG?
>

That's an escapee log statement I thought I'd already rebased out. Well
spotted, fixed.

>
> slot.c:468
> Why did you drop “void" as a parameter type of ReplicationSlotDropAcquired?
>

That's an editing error on my part that I'll reverse. Since the prototype
declares (void) it doesn't matter, but it's a pointless change. Fixed.

> walsender.c: 1509 at PhysicalConfirmReceivedLocation
>
> I’ve noticed a comment stating that we don’t need to call
> ReplicationSlotSave(), but that pre-dated the WAL-logging of replication
> slot changes. Don’t we need to call it now, the same way it’s done for
> the logical slots in logical.c:at LogicalConfirmReceivedLocation?
>

No, it's safe here. All we must ensure is that a slot is advanced on the
replica when it's advanced on the master. For physical slots even that's a
weak requirement, we just have to stop them from falling *too* far behind
and causing too much xlog retention. For logical slots we should ensure we
advance the slot on the replica before any vacuum activity that might
remove catalog tuples still needed by that slot gets replayed. Basically
the difference is that logical slots keep track of the catalog xmin too, so
they have (slightly) stricter requirements.

This patch doesn't touch either of those functions except for
renaming ReplicationSlotsComputeRequiredLSN
to ReplicationSlotsUpdateRequiredLSN . Which, by the way, I really don't
like doing, but I couldn't figure out a name to give the function that
computes-and-returns the required LSN that wouldn't be even more confusing
in the face of having a ReplicationSlotsComputeRequiredLSN function as
well. Ideas welcome.

Updated patch

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Oleksii Kliukin <alexk(at)hintbits(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-02-23 10:30:01
Message-ID:	CAMsr+YFGQVMUUKkZO7QMntLuRBCR7ao7_2e1gJKfynorQN+3Xg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Updated patch
>

... attached

I've split it up a bit more too, so it's easier to tell what change is for
what and fixed the issues mentioned by Oleksii. I've also removed some
unrelated documentation changes.

Patch 0001, timeline switches for logical decoding, is unchanged since the
last post.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
0001-Allow-logical-slots-to-follow-timeline-switches.patch	text/x-patch	19.1 KB
0002-Allow-replication-slots-to-follow-failover.patch	text/x-patch	45.6 KB
0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch	text/x-patch	3.6 KB
0004-Add-the-UI-and-for-failover-slots.patch	text/x-patch	11.7 KB
0005-Document-failover-slots.patch	text/x-patch	8.4 KB
0006-Add-failover-to-pg_replication_slots.patch	text/x-patch	9.6 KB
0007-not-for-inclusion-Test-script-for-failover-slots.patch	text/x-patch	11.1 KB

From:	Oleksii Kliukin <alexk(at)hintbits(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-02-23 19:53:08
Message-ID:	9B503EB5-676A-4258-9F78-27FC583713FE@hintbits.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 23 Feb 2016, at 11:30, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>
>
> Updated patch
>
> ... attached
>
> I've split it up a bit more too, so it's easier to tell what change is for what and fixed the issues mentioned by Oleksii. I've also removed some unrelated documentation changes.
>
> Patch 0001, timeline switches for logical decoding, is unchanged since the last post.

Thank you, I read the user-interface part now, looks good to me.

I found the following issue when shutting down a master with a connected replica that uses a physical failover slot:

2016-02-23 20:33:42.546 CET,,,54998,,56ccb3f3.d6d6,3,,2016-02-23 20:33:07 CET,,0,DEBUG,00000,"performing replication slot checkpoint",,,,,,,,,""
2016-02-23 20:33:42.594 CET,,,55002,,56ccb3f3.d6da,4,,2016-02-23 20:33:07 CET,,0,DEBUG,00000,"archived transaction log file ""000000010000000000000003""",,,,,,,,,""
2016-02-23 20:33:42.601 CET,,,54998,,56ccb3f3.d6d6,4,,2016-02-23 20:33:07 CET,,0,PANIC,XX000,"concurrent transaction log activity while database system is shutting down",,,,,,,,,""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,5,,2016-02-23 20:33:07 CET,,0,LOG,00000,"checkpointer process (PID 54998) was terminated by signal 6: Abort trap",,,,,,,,,""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,6,,2016-02-23 20:33:07 CET,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,

Basically, the issue is that CreateCheckPoint calls CheckpointReplicationSlots, which currently produces WAL, and this violates the assumption at line xlog.c:8492

if (shutdown && checkPoint.redo != ProcLastRecPtr)
ereport(PANIC,
(errmsg("concurrent transaction log activity while database system is shutting down")));

There are a couple of incorrect comments

logical.c: 90
There's some things missing to allow this: I think it should be “There are some things missing to allow this:”

logical.c:93
"we need we would need”

slot.c:889
"and there's no latch to set, so poll” - clearly there is a latch used in the code below.

Also, slot.c:301 emits an error message for an attempt to create a failover slot on the replica after acquiring and releasing the locks and getting the shared memory slot, even though all the data to check for this condition is available right at the beginning of the function. Shouldn’t it avoid the extra work if it’s not needed?

>
>
> --
> Craig Ringer http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/>
> PostgreSQL Development, 24x7 Support, Training & Services
> <0001-Allow-logical-slots-to-follow-timeline-switches.patch><0002-Allow-replication-slots-to-follow-failover.patch><0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch><0004-Add-the-UI-and-for-failover-slots.patch><0005-Document-failover-slots.patch><0006-Add-failover-to-pg_replication_slots.patch><0007-not-for-inclusion-Test-script-for-failover-slots.patch>

Kind regards,
--
Oleksii

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Oleksii Kliukin <alexk(at)hintbits(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-02-24 10:02:09
Message-ID:	CAMsr+YEacvQiHb1HG1KVwGb2QUBSmku5FeaFT4RFgttoG4a1Rw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 24 February 2016 at 03:53, Oleksii Kliukin <alexk(at)hintbits(dot)com> wrote:

>
> I found the following issue when shutting down a master with a connected
> replica that uses a physical failover slot:
>
> 2016-02-23 20:33:42.546 CET,,,54998,,56ccb3f3.d6d6,3,,2016-02-23 20:33:07
> CET,,0,DEBUG,00000,"performing replication slot checkpoint",,,,,,,,,""
> 2016-02-23 20:33:42.594 CET,,,55002,,56ccb3f3.d6da,4,,2016-02-23 20:33:07
> CET,,0,DEBUG,00000,"archived transaction log file
> ""000000010000000000000003""",,,,,,,,,""
> 2016-02-23 20:33:42.601 CET,,,54998,,56ccb3f3.d6d6,4,,2016-02-23 20:33:07
> CET,,0,PANIC,XX000,"concurrent transaction log activity while database
> system is shutting down",,,,,,,,,""
> 2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,5,,2016-02-23 20:33:07
> CET,,0,LOG,00000,"checkpointer process (PID 54998) was terminated by signal
> 6: Abort trap",,,,,,,,,""
> 2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,6,,2016-02-23 20:33:07
> CET,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,
>
>
Odd that I didn't see that in my testing. Thanks very much for this. I
concur with your explanation.

Basically, the issue is that CreateCheckPoint calls
> CheckpointReplicationSlots, which currently produces WAL, and this violates
> the assumption at line xlog.c:8492
>
> if (shutdown && checkPoint.redo != ProcLastRecPtr)
> ereport(PANIC,
> (errmsg("concurrent transaction log activity while database system is
> shutting down")));
>

Interesting problem.

It might be reasonably harmless to omit writing WAL for failover slots
during a shutdown checkpoint. We're using WAL to move data to the replicas
but we don't really need it for local redo and correctness on the master.
The trouble is that we do of course redo failover slot updates on the
master and we don't really want a slot to go backwards vs its on-disk state
before a crash. That's not too harmful - but might be able to lead to us
losing a slot catalog_xmin increase so the slot thinks catalog is still
readable that could've actually been vacuumed away.

CheckpointReplicationSlots notes that:

* This needn't actually be part of a checkpoint, but it's a convenient
* location.

... and I suspect the answer there is simply to move the slot checkpoint to
occur prior to the WAL checkpoint rather than during it. I'll investigate.

I really want to focus on the first patch, timeline following for logical
slots. That part is much less invasive and is useful stand-alone. I'll move
it to a separate CF entry and post it to a separate thread as I think it
needs consideration independently of failover slots.

(BTW, the slot docs promise that slots will replay a change exactly once,
but this is not correct and the client must keep track of replay position.
I'll post a patch to correct it separately).

> There are a couple of incorrect comments
>

Thanks, will amend.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Oleksii Kliukin <alexk(at)hintbits(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-03-04 16:14:09
Message-ID:	CAMsr+YFK40C2KqZFEkY4spOC8XLo9+8HZQTvGQH6WPzWxax85Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 24 February 2016 at 18:02, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:

> I really want to focus on the first patch, timeline following for logical
> slots. That part is much less invasive and is useful stand-alone. I'll move
> it to a separate CF entry and post it to a separate thread as I think it
> needs consideration independently of failover slots.
>

Just an update on the failover slots status: I've moved timeline following
for logical slots into its own patch set and CF entry and added a bunch of
tests.

https://commitfest.postgresql.org/9/488/

Some perl TAP test framework enhancements were needed for that; they're
mostly committed now with a few pending.

https://commitfest.postgresql.org/9/569/

Once some final changes are made to the tests for timeline following I'll
address the checkpoint issue in failover slots by doing the checkpoint of
slots at the start of a checkpoint/restartpoint, while we can still write
WAL. Per the comments in CheckPointReplicationSlots it's mostly done in a
checkpoint currently for convenience.

Then I'll write some TAP tests for failover slots and submit an updated
patch for them, by which time hopefully timeline following for logical
slots will be committed.

In other words this patch isn't dead, the foundations are just being
rebased out from under it.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Oleksii Kliukin <alexk(at)hintbits(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-03-15 13:40:59
Message-ID:	CAMsr+YHcAHoWKSnWQdYSWDsLZNpmP3BFHr4n7_hEwv4g1T7j8g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here's a new failover slots rev, addressing the issues Oleksii Kliukin
raised and adding a bunch of TAP tests.

In particular, for the checkpoint issue I landed up moving
CheckPointReplicationSlots to occur at the start of a checkpoint, before
writing WAL is prohibited. As the comments note it's just a convenient
place and time to do it anyway. That means it has to be called separately
at a restartpoint, but I don't think that's a biggie.

The tests for this took me quite a while, much (much) longer than the code
changes.

I split the patch up a bit more too so individual changes are more
logically grouped and clearer. I expect it'd be mostly or entirely squashed
for commit.

Attachment	Content-Type	Size
0001-Allow-replication-slots-to-follow-failover.patch	text/x-patch	47.0 KB
0002-Update-decoding_failover-tests-for-failover-slots.patch	text/x-patch	3.2 KB
0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch	text/x-patch	3.6 KB
0004-Add-the-UI-and-for-failover-slots.patch	text/x-patch	11.7 KB
0005-Document-failover-slots.patch	text/x-patch	8.4 KB
0006-Add-failover-to-pg_replication_slots.patch	text/x-patch	9.6 KB
0007-Introduce-TAP-recovery-tests-for-failover-slots.patch	text/x-patch	14.2 KB

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Oleksii Kliukin <alexk(at)hintbits(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-03-16 10:04:44
Message-ID:	CAMsr+YEWXGXx=GjcW_4O-O5+YRMNm0c=G-MZnofwxU-=cdQ-8w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15 March 2016 at 21:40, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:

> Here's a new failover slots rev, addressing the issues Oleksii Kliukin
> raised and adding a bunch of TAP tests.
>

Ahem, just found an issue here. I'll need to send another revision.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Oleksii Kliukin <alexk(at)hintbits(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-03-17 08:34:27
Message-ID:	CAMsr+YE6LNy2e0tBuAQB+NTVb6W-dHJAfLq0-zbAL7G7hjhXBA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

OK, here's the latest failover slots patch, rebased on top of today's
master plus, in order:

- Dirty replication slots when confirm_lsn is changed
(
http://www.postgresql.org/message-id/CAMsr+YHJ0OyCUG2zbyQpRHxMcjnkt9D57mSxDZgWBKcvx3+r-w@mail.gmail.com
)

- logical decoding timeline following
(
http://www.postgresql.org/message-id/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com
)

The full tree is at
https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots if you want
to avoid the fiddling around required to apply the patch series.

Attachment	Content-Type	Size
0001-Allow-replication-slots-to-follow-failover.patch	text/x-patch	48.0 KB
0002-Update-decoding_failover-tests-for-failover-slots.patch	text/x-patch	3.2 KB
0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch	text/x-patch	3.6 KB
0004-Add-the-UI-and-for-failover-slots.patch	text/x-patch	11.7 KB
0005-Document-failover-slots.patch	text/x-patch	8.4 KB
0006-Add-failover-to-pg_replication_slots.patch	text/x-patch	9.6 KB
0007-Introduce-TAP-recovery-tests-for-failover-slots.patch	text/x-patch	25.4 KB

From:	Oleksii Kliukin <alexk(at)hintbits(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-04-04 20:19:32
Message-ID:	6458A405-0C44-4A6E-9976-1FC696642666@hintbits.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

> On 17 Mar 2016, at 09:34, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>
> OK, here's the latest failover slots patch, rebased on top of today's master plus, in order:
>
> - Dirty replication slots when confirm_lsn is changed
> (http://www.postgresql.org/message-id/CAMsr+YHJ0OyCUG2zbyQpRHxMcjnkt9D57mSxDZgWBKcvx3+r-w@mail.gmail.com <http://www.postgresql.org/message-id/CAMsr+YHJ0OyCUG2zbyQpRHxMcjnkt9D57mSxDZgWBKcvx3+r-w@mail.gmail.com>)
>
> - logical decoding timeline following
> (http://www.postgresql.org/message-id/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com <http://www.postgresql.org/message-id/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com>)
>
> The full tree is at https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots <https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots> if you want to avoid the fiddling around required to apply the patch series.
>
>
> <0001-Allow-replication-slots-to-follow-failover.patch><0002-Update-decoding_failover-tests-for-failover-slots.patch><0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch><0004-Add-the-UI-and-for-failover-slots.patch><0005-Document-failover-slots.patch><0006-Add-failover-to-pg_replication_slots.patch><0007-Introduce-TAP-recovery-tests-for-failover-slots.patch>

Thank you for the update. I’ve got some rejects when applying the 0001-Allow-replication-slots-to-follow-failover.patch after the "Dirty replication slots when confirm_lsn is changed” changes. I think it should be rebased against the master, (might be the consequence of the "logical slots follow the timeline” patch committed).

patch -p1 <~/git/pg/patches/failover-slots/v6/0001-Allow-replication-slots-to-follow-failover.patch
patching file src/backend/access/rmgrdesc/Makefile
Hunk #1 FAILED at 10.
1 out of 1 hunk FAILED -- saving rejects to file src/backend/access/rmgrdesc/Makefile.rej
patching file src/backend/access/rmgrdesc/replslotdesc.c
patching file src/backend/access/transam/rmgr.c
Hunk #1 succeeded at 25 (offset 1 line).
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 6351 (offset 3 lines).
Hunk #2 succeeded at 8199 (offset 14 lines).
Hunk #3 succeeded at 8645 (offset 14 lines).
Hunk #4 succeeded at 8718 (offset 14 lines).
patching file src/backend/commands/dbcommands.c
patching file src/backend/replication/basebackup.c
patching file src/backend/replication/logical/decode.c
Hunk #1 FAILED at 143.
1 out of 1 hunk FAILED -- saving rejects to file src/backend/replication/logical/decode.c.rej
patching file src/backend/replication/logical/logical.c
patching file src/backend/replication/slot.c
patching file src/backend/replication/slotfuncs.c
patching file src/backend/replication/walsender.c
patching file src/bin/pg_xlogdump/replslotdesc.c
patching file src/bin/pg_xlogdump/rmgrdesc.c
Hunk #1 succeeded at 27 (offset 1 line).
patching file src/include/access/rmgrlist.h
Hunk #1 FAILED at 45.
1 out of 1 hunk FAILED -- saving rejects to file src/include/access/rmgrlist.h.rej
patching file src/include/replication/slot.h
patching file src/include/replication/slot_xlog.h
can't find file to patch at input line 1469
Perhaps you used the wrong -p or --strip option?

--
Oleksii

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Oleksii Kliukin <alexk(at)hintbits(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-04-06 02:41:54
Message-ID:	CAMsr+YFt4iDZQ8pkKQ+Rd37howXE57rF3O1m+a9S=WZrXvF3qg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5 April 2016 at 04:19, Oleksii Kliukin <alexk(at)hintbits(dot)com> wrote:

> Thank you for the update. I’ve got some rejects when applying the
> 0001-Allow-replication-slots-to-follow-failover.patch after the "Dirty
> replication slots when confirm_lsn is changed” changes. I think it should
> be rebased against the master, (might be the consequence of the "logical
> slots follow the timeline” patch committed).
>

I'll rebase it on top of the new master after timeline following for
logical slots got committed and follow up shortly.

That said, I've marked this patch 'returned with feedback' in the CF. It
should possibly actually be 'rejected' given the discussion on the logical
decoding timeline following thread, which points heavily at a different
approach to solving this problem in 9.7.

That doesn't mean nobody can pick it up if they think it's valuable and
want to run with it, but we're very close to feature freeze now.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-04-06 09:43:23
Message-ID:	CANP8+j+tAHaAtVTxwHtpjEV9gjk+M8CAQFyOPo_=Y5pQ+fEzxA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 25 January 2016 at 14:25, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:

> I'd like to get failover slots in place for 9.6 since the're fairly
> self-contained and meet an immediate need: allowing replication using slots
> (physical or logical) to follow a failover event.
>

I'm a bit confused about this now.

We seem to have timeline following, yet no failover slot. How do we now
follow a failover event?

There are many and varied users of logical decoding now and a fix is
critically important for 9.6.

Do all decoding plugins need to write their own support code??

Please explain how we cope without this, so if a problem remains we can fix
by the freeze.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-04-06 11:49:08
Message-ID:	CAMsr+YHYV78q_8gDKOgTNZQD9Lrfwa=5E0kOfFbrjjTDHhX+4A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6 April 2016 at 17:43, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On 25 January 2016 at 14:25, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>
>
>> I'd like to get failover slots in place for 9.6 since the're fairly
>> self-contained and meet an immediate need: allowing replication using slots
>> (physical or logical) to follow a failover event.
>>
>
> I'm a bit confused about this now.
>
> We seem to have timeline following, yet no failover slot. How do we now
> follow a failover event?
>

> There are many and varied users of logical decoding now and a fix is
> critically important for 9.6.
>

I agree with you, but I haven't been able to convince enough people of that.

> Do all decoding plugins need to write their own support code??
>

We'll be able to write a bgworker based extension that handles it by
running in the standby. So no, I don't think so.

> Please explain how we cope without this, so if a problem remains we can
> fix by the freeze.
>

The TL;DR: Create a slot on the master to hold catalog_xmin where the
replica needs it. Advance it using client or bgworker on replica based on
the catalog_xmin of the oldest slot on the replica. Copy slot state from
the master using an extension that keeps the slots on the replica
reasonably up to date.

All of this is ugly workaround for not having true slot failover support.
I'm not going to pretend it's nice, or anything that should go anywhere
near core. Petr outlined the approach we want to take for core in 9.7 on
the logical timeline following thread.

Details:

Logical decoding on a slot can follow timeline switches now - or rather,
the xlogreader knows how to follow timeline switches, and the read page
callback used by logical decoding uses that functionality now.

This doesn't help by its self because slots aren't synced to replicas so
they're lost on failover promotion.

Nor can a client just create a backup slot for its self on the replica to
be ready for failover:

- it has no way to create a new slot at a consistent point on the replica
since logical decoding isn't supported on replicas yet;
- it can't advance a logical slot on the replica once created since
decoding isn't permitted on a replica, so it can't just decode from the
replica in lockstep with the master;
- it has no way to stop the master from removing catalog tuples still
needed by the slot's catalog_xmin since catalog_xmin isn't propagated from
standby to master.

So we have to help the client out. To do so, we have a
function/worker/whatever on the replica that grabs the slot state from the
master and copies it to the replica, and we have to hold the master's
catalog_xmin down to the catalog_xmin required by the slots on the replica.

Holding the catalog_xmin down is the easier bit. We create a dummy logical
slot on the master, maintained by a function/bgworker/whatever on the
replica. It gets advanced so that its restart_lsn and catalog_xmin are
those of the oldest slot on the replica. We can do that by requesting
replay on it up to the confirmed_lsn of the lowest confirmed_lsn on the
replica. Ugly, but workable. Or we can abuse the infrastructure more deeply
by simply setting the catalog_xmin and restart_lsn on the slot directly,
but I'd rather not.

Just copying slot state is pretty simple too, as at the C level you can
create a physical or logical slot with whatever state you want.

However, that lets you copy/create any number of bogus ones, many of which
will appear to work fine but will be subtly broken. Since the replica is an
identical copy of the master we know that a slot state that was valid on
the master at a given xlog insert lsn is also valid on the replica at the
same replay lsn, but we've got no reliable way to ensure that when the
master updates a slot at LSN A/B the replica also updates the slot at
replay of LSN A/B. That's what failover slots did. Without that we need to
use some external channel - but there's no way to capture knowledge of "at
exactly LSN A/B, master saved a new copy of slot X" since we can't hook
ReplicationSlotSave(). At least we *can* now inject slot state updates as
generic WAL messages though, so we can ensure they happen at exactly the
desired point in replay.

As Andres explained on the timeline following thread it's not safe for the
slot on the replica to be behind the state the slot on the master was at
the same LSN. At least unless we can protect catalog_xmin via some other
mechanism so we can make sure no catalogs still needed by the slots on the
replica are vacuumed away. It's vital that the catalog_xmin of any slots on
the replica be >= the catalog_xmin the master had for the lowest
catalog_xmin of any of its slots at the same LSN.

So what I figure we'll do is poll slot shmem on the master. When we notice
that a slot has changed we'll dump it into xlog via the generic xlog
mechanism to be applied on the replica, much like failover slots. The slot
update might arrive a bit late on the replica, but that's OK because we're
holding catalog_xmin pinned on the master using the dummy slot.

I don't like it, but I don't have anything better for 9.6.

I'd really like to be able to build a more solid proof of concept that
tests this with a lagging replica, but -ENOTIME before FF.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-04-06 13:15:38
Message-ID:	CAMsr+YEVW-_wivY8yJQCTiJ20u7V4P_xNWNx55MaJ70mLS7O6g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

A few thoughts on failover slots vs the alternative of pushing catalog_xmin
up to the master via a replica's slot and creating independent slots on
replicas.

Failover slots:
---

+ Failover slots are very easy for applications. They "just work" and are
transparent for failover. This is great especially for things that aren't
complex replication schemes, that just want to use logical decoding.

+ Applications don't have to know what replicas exist or be able to reach
them; transparent failover is easier.

- Failover slots can't be used from a cascading standby (where we can fail
down to the standby's own replicas) because they have to write WAL to
advance the slot position. They'd have to send the slot position update
"up" to the master then wait to replay it. Not a disaster, though they'd do
extra work on reconnect until a restart_lsn update replayed. Would require
a whole new feedback-like message on the rep protocol, and couldn't work at
all with archive replication. Ugly as hell.

+ Failover slots exist now, and could be added to 9.6.

- The UI for failover slots can't be re-used for the catalog_xmin push-up
approach to allow replay from failover slots on cascading standbys in 9.7+.
There'd be no way to propagate the creation of failover slots "down" the
replication heirarchy that way, especially to archive standbys like
failover slots will do. So it'd be semantically different and couldn't
re-use the FS UI. We'd be stuck with failover slots even if we also did the
other way later.

+ Will work for recovery of a master PITR-restored up to the latest
recovery point

Independent slots on replicas + catalog_xmin push-up
---

With this approach we allow creation of replication slots on a replica
independently of the master. The replica is required to connect to the
master via a slot. We send feedback to the master to advance the replica's
slot on the master to the confirmed_lsn of the most-behind slot on the
replica, therefore pinning master's catalog_xmin where needed. Or we just
send a new feedback message type that directly sets a catalog_xmin on the
replica's physical slot in the master. Slots are _not_ cloned from master
to replica automatically.

- More complicated for applications to use. They have to create a slot on
each replica that might be failed over to as well as the master and have to
advance all those slots to stop the master from suffering severe catalog
bloat. (But see note below).

- Applications must be able to connect to failover-candidate standbys and
know where they are, it's not automagically handled via WAL. (But see note
below).

- Applications need reconfiguration whenever a standby is rebuilt, moved,
etc. (But see note below).

- Cannot work at all for archive-based replication, requires a slot from
replica to master.

+ Works with replay from cascading standbys

+ Actually solves one of the problems making logical slots on standbys
unsupported at the moment by giving us a way to pin the master's
catalog_xmin to that needed by a replica.

- Won't work for a standby PITR-restored up to latest.

- Vapourware with zero hope for 9.6

Note: I think the application complexity issues can be solved - to a degree
- by having the replicas run a bgworker based helper that connects to the
master and clones the master's slots then advances them automatically.

Do nothing
---

Drop the idea of being able to follow physical failover on logical slots.

I've already expressed why I think this is a terrible idea. It's hostile to
application developers who'd like to use logical decoding. It makes
integration of logical replication with existing HA systems much harder. It
means we need really solid, performant, well-tested and mature logical rep
based HA before we can take logical rep seriously, which is a long way out
given that we can't do decoding of in-progress xacts, ddl, sequences, ....
etc etc.

Some kind of physical HA for logical slots is needed and will be needed for
some time. Logical rep will be great for selective replication, replication
over WAN, filtered/transformed replication etc. Physical rep is great for
knowing you'll get exactly the same thing on the replica that you have on
the master and it'll Just Work.

In any case, "Do nothing" is the same for 9.6 as pursusing the catalog_xmin
push-up idea; in both cases we don't commit anything in 9.6.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: WIP: Failover Slots
Date:	2016-04-06 13:30:21
Message-ID:	CANP8+j+rqJCE=wnkRpubWUDom=tJTtdcnF7brjjmHGtzEvpxpA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6 April 2016 at 14:15, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
...

Nice summary

Failover slots are optional. And they work on master.

While the other approach could also work, it will work later and still
require a slot on the master.

=> I don't see why having Failover Slots in 9.6 would prevent us from
having something else later, if someone else writes it.

We don't need to add this to core. Each plugin can independently write is
own failover code. Works, but doesn't seem like the right approach for open
source.

=> I think we should add Failover Slots to 9.6.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-04-06 14:17:51
Message-ID:	20160406141751.ijqcere2a4jnlaak@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2016-04-06 14:30:21 +0100, Simon Riggs wrote:
> On 6 April 2016 at 14:15, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> ...
>
> Nice summary
>
> Failover slots are optional. And they work on master.
>
> While the other approach could also work, it will work later and still
> require a slot on the master.
>
>
> => I don't see why having Failover Slots in 9.6 would prevent us from
> having something else later, if someone else writes it.
>
>
> We don't need to add this to core. Each plugin can independently write is
> own failover code. Works, but doesn't seem like the right approach for open
> source.
>
>
> => I think we should add Failover Slots to 9.6.

Simon, please don't take this personal; because of the other ongoing
thread.

I don't think this is commit-ready. For one I think this is
architecturally the wrong choice. But even leaving that fact aside, and
considering this a temporary solution (we can't easily remove), there
appears to have been very little code level review (one early from Petr
in [1], two by Oleksii mostly focusing on error messages [2] [3]). The
whole patch was submitted late to the 9.6 cycle.

Quickly skimming 0001 in [4] there appear to be a number of issues:
* LWLockHeldByMe() is only for debugging, not functional differences
* ReplicationSlotPersistentData is now in an xlog related header
* The code and behaviour around name conflicts of slots seems pretty
raw, and not discussed
* Taking spinlocks dependant on InRecovery() seems like a seriously bad
idea
* I doubt that the archive based switches around StartupReplicationSlots
do what they intend. Afaics that'll not work correctly for basebackups
taken with -X, without recovery.conf

That's from a ~5 minute skim, of one patch in the series.

[1] http://archives.postgresql.org/message-id/CALLjQTSCHvcsF6y7%3DZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw%40mail.gmail.com
[2] http://archives.postgresql.org/message-id/FA68178E-F0D1-47F6-9791-8A3E2136C119%40hintbits.com
[3] http://archives.postgresql.org/message-id/9B503EB5-676A-4258-9F78-27FC583713FE%40hintbits.com
[4] http://archives.postgresql.org/message-id/CAMsr+YE6LNy2e0tBuAQB+NTVb6W-dHJAfLq0-zbAL7G7hjhXBA@mail.gmail.com

Greetings,

Andres Freund

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-04-06 16:06:31
Message-ID:	CANP8+jJ+YFHO3Ow3HOhH8SAumGDhaRtt=bDtAvchcbdWySv+rw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6 April 2016 at 15:17, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2016-04-06 14:30:21 +0100, Simon Riggs wrote:
> > On 6 April 2016 at 14:15, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> > ...
> >
> > Nice summary
> >
> > Failover slots are optional. And they work on master.
> >
> > While the other approach could also work, it will work later and still
> > require a slot on the master.
> >
> >
> > => I don't see why having Failover Slots in 9.6 would prevent us from
> > having something else later, if someone else writes it.
> >
> >
> > We don't need to add this to core. Each plugin can independently write is
> > own failover code. Works, but doesn't seem like the right approach for
> open
> > source.
> >
> >
> > => I think we should add Failover Slots to 9.6.
>
> Simon, please don't take this personal; because of the other ongoing
> thread.
>

Thanks for the review. Rational technical comments are exactly why we are
here and they are always welcome.

> For one I think this is architecturally the wrong choice.

But even leaving that fact aside, and
> considering this a temporary solution (we can't easily remove),

As I observed above, the alternate solution doesn't sound particularly good
either but the main point is that we wouldn't need to remove it, it can
coexist happily. I would add that I did think of the alternate solution
previously as well, this one seemed simpler, which is always key for me in
code aimed at robustness.

> there appears to have been very little code level review

That is potentially fixable. At this point I don't claim it is committable,
I only say it is important and the alternate solution is not significantly
better, therefore if the patch can be beaten into shape we should commit it.

I will spend some time on this and see if we have something viable. Which
will be posted here for discussion, as would have happened even before our
other discussions.

Thanks for the points below

(one early from Petr
> in [1], two by Oleksii mostly focusing on error messages [2] [3]). The
> whole patch was submitted late to the 9.6 cycle.
>
> Quickly skimming 0001 in [4] there appear to be a number of issues:
> * LWLockHeldByMe() is only for debugging, not functional differences
> * ReplicationSlotPersistentData is now in an xlog related header
> * The code and behaviour around name conflicts of slots seems pretty
> raw, and not discussed
> * Taking spinlocks dependant on InRecovery() seems like a seriously bad
> idea
> * I doubt that the archive based switches around StartupReplicationSlots
> do what they intend. Afaics that'll not work correctly for basebackups
> taken with -X, without recovery.conf
>
> That's from a ~5 minute skim, of one patch in the series.
>
>
> [1]
> http://archives.postgresql.org/message-id/CALLjQTSCHvcsF6y7%3DZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw%40mail.gmail.com
> [2]
> http://archives.postgresql.org/message-id/FA68178E-F0D1-47F6-9791-8A3E2136C119%40hintbits.com
> [3]
> http://archives.postgresql.org/message-id/9B503EB5-676A-4258-9F78-27FC583713FE%40hintbits.com
> [4]
> http://archives.postgresql.org/message-id/CAMsr+YE6LNy2e0tBuAQB+NTVb6W-dHJAfLq0-zbAL7G7hjhXBA@mail.gmail.com
>

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2016-04-08 06:13:03
Message-ID:	CAMsr+YGrOMCzWhOj9z0UySB_EdRb2R+qUKyFUFJ31hbBh43qBA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6 April 2016 at 22:17, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Quickly skimming 0001 in [4] there appear to be a number of issues:
> * LWLockHeldByMe() is only for debugging, not functional differences
> * ReplicationSlotPersistentData is now in an xlog related header
> * The code and behaviour around name conflicts of slots seems pretty
> raw, and not discussed
> * Taking spinlocks dependant on InRecovery() seems like a seriously bad
> idea
> * I doubt that the archive based switches around StartupReplicationSlots
> do what they intend. Afaics that'll not work correctly for basebackups
> taken with -X, without recovery.conf
>
>
Thanks for looking at it. Most of those are my errors. I think this is
pretty dead at least for 9.6, so I'm mostly following up in the hopes of
learning about a couple of those mistakes.

Good catch with -X without a recovery.conf. Since it wouldn't be recognised
as a promotion and wouldn't increment the timeline, copied non-failover
slots wouldn't get removed. I've never liked that logic at all anyway, I
just couldn't think of anything better...

LWLockHeldByMe() has a comment to the effect of: "This is meant as debug
support only." So that's just a dumb mistake on my part, and I should've
added "alreadyLocked" parameters. (Ugly, but works).

But why would it be a bad idea to conditionally take a code path that
acquires a spinlock based on whether RecoveryInProgress()? It's not testing
RecoveryInProgress() more than once and doing the acquire and release based
on separate tests, which would be a problem. I don't really get the problem
with:

if (!RecoveryInProgress())
{
/* first check whether there's something to write out */
SpinLockAcquire(&slot->mutex);
was_dirty = slot->dirty;
slot->just_dirtied = false;
SpinLockRelease(&slot->mutex);

/* and don't do anything if there's nothing to write */
if (!was_dirty)
return;
}

... though I think what I really should've done there is just always
dirty the slot in the redo functions.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Thom Brown <thom(at)linux(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2017-07-25 16:16:34
Message-ID:	CAA-aLv4S0wVoCOAezZqy6o7LU7MNS-a6rmHwhDgyj-PYoaJ_1Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8 April 2016 at 07:13, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> On 6 April 2016 at 22:17, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
>>
>> Quickly skimming 0001 in [4] there appear to be a number of issues:
>> * LWLockHeldByMe() is only for debugging, not functional differences
>> * ReplicationSlotPersistentData is now in an xlog related header
>> * The code and behaviour around name conflicts of slots seems pretty
>> raw, and not discussed
>> * Taking spinlocks dependant on InRecovery() seems like a seriously bad
>> idea
>> * I doubt that the archive based switches around StartupReplicationSlots
>> do what they intend. Afaics that'll not work correctly for basebackups
>> taken with -X, without recovery.conf
>>
>
> Thanks for looking at it. Most of those are my errors. I think this is
> pretty dead at least for 9.6, so I'm mostly following up in the hopes of
> learning about a couple of those mistakes.
>
> Good catch with -X without a recovery.conf. Since it wouldn't be recognised
> as a promotion and wouldn't increment the timeline, copied non-failover
> slots wouldn't get removed. I've never liked that logic at all anyway, I
> just couldn't think of anything better...
>
> LWLockHeldByMe() has a comment to the effect of: "This is meant as debug
> support only." So that's just a dumb mistake on my part, and I should've
> added "alreadyLocked" parameters. (Ugly, but works).
>
> But why would it be a bad idea to conditionally take a code path that
> acquires a spinlock based on whether RecoveryInProgress()? It's not testing
> RecoveryInProgress() more than once and doing the acquire and release based
> on separate tests, which would be a problem. I don't really get the problem
> with:
>
> if (!RecoveryInProgress())
> {
> /* first check whether there's something to write out */
> SpinLockAcquire(&slot->mutex);
> was_dirty = slot->dirty;
> slot->just_dirtied = false;
> SpinLockRelease(&slot->mutex);
>
> /* and don't do anything if there's nothing to write */
> if (!was_dirty)
> return;
> }
>
> ... though I think what I really should've done there is just always dirty
> the slot in the redo functions.

Are there any plans to submit a new design/version for v11?

Thanks

Thom

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Thom Brown <thom(at)linux(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2017-07-26 00:44:05
Message-ID:	CAMsr+YFqhhNUR+7ZYaWpAyHyEPDQ6CoBPtrD3H0gEkWpQDvCmA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 26 July 2017 at 00:16, Thom Brown <thom(at)linux(dot)com> wrote:

> On 8 April 2016 at 07:13, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> > On 6 April 2016 at 22:17, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> >>
> >> Quickly skimming 0001 in [4] there appear to be a number of issues:
> >> * LWLockHeldByMe() is only for debugging, not functional differences
> >> * ReplicationSlotPersistentData is now in an xlog related header
> >> * The code and behaviour around name conflicts of slots seems pretty
> >> raw, and not discussed
> >> * Taking spinlocks dependant on InRecovery() seems like a seriously bad
> >> idea
> >> * I doubt that the archive based switches around StartupReplicationSlots
> >> do what they intend. Afaics that'll not work correctly for basebackups
> >> taken with -X, without recovery.conf
> >>
> >
> > Thanks for looking at it. Most of those are my errors. I think this is
> > pretty dead at least for 9.6, so I'm mostly following up in the hopes of
> > learning about a couple of those mistakes.
> >
> > Good catch with -X without a recovery.conf. Since it wouldn't be
> recognised
> > as a promotion and wouldn't increment the timeline, copied non-failover
> > slots wouldn't get removed. I've never liked that logic at all anyway, I
> > just couldn't think of anything better...
> >
> > LWLockHeldByMe() has a comment to the effect of: "This is meant as debug
> > support only." So that's just a dumb mistake on my part, and I should've
> > added "alreadyLocked" parameters. (Ugly, but works).
> >
> > But why would it be a bad idea to conditionally take a code path that
> > acquires a spinlock based on whether RecoveryInProgress()? It's not
> testing
> > RecoveryInProgress() more than once and doing the acquire and release
> based
> > on separate tests, which would be a problem. I don't really get the
> problem
> > with:
> >
> > if (!RecoveryInProgress())
> > {
> > /* first check whether there's something to write out */
> > SpinLockAcquire(&slot->mutex);
> > was_dirty = slot->dirty;
> > slot->just_dirtied = false;
> > SpinLockRelease(&slot->mutex);
> >
> > /* and don't do anything if there's nothing to write */
> > if (!was_dirty)
> > return;
> > }
> >
> > ... though I think what I really should've done there is just always
> dirty
> > the slot in the redo functions.
>
> Are there any plans to submit a new design/version for v11?

No. The whole approach seems to have been bounced from core. I don't agree
and continue to think this functionality is desirable but I don't get to
make that call.

If time permits I will attempt to update the logical decoding on standby
patchset instead, and if possible add support for fast-forward logical
decoding that does the minimum required to correctly maintain a slot's
catalog_xmin and restart_lsn when advanced. But this won't be usable
directly for failover like failover slots are, it'll require each
application to keep track of standbys and maintain slots on them too. Or
we'll need some kind of extension/helper to sync slot state.

In the mean time, 2ndQuadrant maintains an on-disk-compatible version of
failover slots that's available for support customers.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2017-08-02 20:35:17
Message-ID:	CA+TgmoZsrQm+ALJDjMhfsvPJ3pHAAdSV363oiV6oOkBFc=ihEg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 25, 2017 at 8:44 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> No. The whole approach seems to have been bounced from core. I don't agree
> and continue to think this functionality is desirable but I don't get to
> make that call.

I actually think failover slots are quite desirable, especially now
that we've got logical replication in core. In a review of this
thread I don't see anyone saying otherwise. The debate has really
been about the right way of implementing that. Suppose we did
something like this:

- When a standby connects to a master, it can optionally supply a list
of slot names that it cares about.
- The master responds by periodically notifying the standby of changes
to the slot contents using some new replication sub-protocol message.
- The standby applies those updates to its local copies of the slots.

So, you could create a slot on a standby with an "uplink this" flag of
some kind, and it would then try to keep it up to date using the
method described above. It's not quite clear to me how to handle the
case where the corresponding slot doesn't exist on the master, or
initially does but then it's later dropped, or it initially doesn't
but it's later created.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2017-08-08 08:00:43
Message-ID:	CAMsr+YH0y2V9615s7Aeedzs3DvgrWBt28LP2K3FQi6uKKfJjMw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3 August 2017 at 04:35, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Jul 25, 2017 at 8:44 PM, Craig Ringer <craig(at)2ndquadrant(dot)com>
> wrote:
> > No. The whole approach seems to have been bounced from core. I don't
> agree
> > and continue to think this functionality is desirable but I don't get to
> > make that call.
>
> I actually think failover slots are quite desirable, especially now
> that we've got logical replication in core. In a review of this
> thread I don't see anyone saying otherwise. The debate has really
> been about the right way of implementing that. Suppose we did
> something like this:
>
> - When a standby connects to a master, it can optionally supply a list
> of slot names that it cares about.
>

Wouldn't that immediately exclude use for PITR and snapshot recovery? I
have people right now who want the ability to promote a PITR-recovered
snapshot into place of a logical replication master and have downstream
peers replay from it. It's more complex than that, as there's a resync
process required to recover changes the failed node had sent to other peers
but isn't available in the WAL archive, but that's the gist.

If you have a 5TB database do you want to run an extra replica or two
because PostgreSQL can't preserve slots without a running, live replica?
Your SAN snapshots + WAL archiving have been fine for everything else so
far.

Requiring live replication connections could also be an issue for service
interruptions, surely? Unless you persist needed knowledge in the physical
replication slot used by the standby to master connection, so the master
can tell the difference between "downstream went away for while but will
come back" and "downstream is gone forever, toss out its resources."

That's exactly what the catalog_xmin hot_standby_feedback patches in Pg10
do, but they can only tell the master about the oldest resources needed by
any existing slot on the replica. Not which slots. And they have the same
issues with needing a live, running replica.

Also, what about cascading? Lots of "pull" model designs I've looked at
tend to fall down in cascaded environments. For that matter so do failover
slots, but only for the narrower restriction of not being able to actually
decode from a failover-enabled slot on a standby, they still work fine in
terms of cascading down to leaf nodes.

- The master responds by periodically notifying the standby of changes
> to the slot contents using some new replication sub-protocol message.
> - The standby applies those updates to its local copies of the slots.
>

That's pretty much what I expect to have to do for clients to work on
unpatched Pg10, probably using a separate bgworker and normal libpq
connections to the upstream since we don't have hooks to extend the
walsender/walreceiver.

It can work now that the catalog_xmin hot_standby_feedback patches are in,
but it'd require some low-level slot state setting that I know Andres is
not a fan of. So I expect to carry on relying on an out-of-tree failover
slots patch for Pg 10.

> So, you could create a slot on a standby with an "uplink this" flag of
> some kind, and it would then try to keep it up to date using the
> method described above. It's not quite clear to me how to handle the
> case where the corresponding slot doesn't exist on the master, or
> initially does but then it's later dropped, or it initially doesn't
> but it's later created.
>
> Thoughts?

Right. So the standby must be running and in active communication. It needs
some way to know the master has confirmed slot creation and it can rely on
the slot's resources really being reserved by the master. That turns out to
be quite hard, per the decoding on standby patches. There needs to be some
way to tell the master a standby has gone away forever and to drop its
dependent slots, so you're not stuck wondering "is slot xxyz from standby
abc that we lost in that crash?". Standbys need to cope with having created
a slot, only to find out there's a name collision with master.

For all those reasons, I just extended hot_standby_feedback to report
catalog_xmin separately to upstreams instead, so the existing physical slot
serves all these needs. And it's part of the picture, but there's no way to
get slot position change info from the master back down onto the replicas
so the replicas can advance any of their own slots and, via feedback, free
up master resources. That's where the bgworker hack to query
pg_replication_slots comes in. Seems complex, full of restrictions, and
fragile to me compared to just expecting the master to do it.

The only objection I personally understood and accepted re failover slots
was that it'd be impossible to create a failover slot on a standby and have
that standby "sub-tree" support failover to leaf nodes. Which is true, but
instead we have noting and no viable looking roadmap toward anything users
can benefit from. So I don't think that's the worst restriction in the
world.

I do not understand why logical replication slots are exempt from our usual
policy that anything that works on the master should be expected to work on
failover to a standby. Is there anything persistent across crash for which
that's not the case, except grandfathered-in hash indexes? We're hardly
going to say "hey, it's ok to forget about prepared xacts when you fail
over to a standby" yet this problem with failover and slots in logical
decoding and replication is the same sort of showstopper issue for users
who use the functionality.

In the medium term I've given up making progress with getting something
simple and usable into user hands on this. A tweaked version of failover
slots is being carried as an out-of-tree on-disk-format-compatible patch
instead, and it's meeting customer needs very well. I've done my dash here
and moved on to other things where I can make more progress.

I'd like to continue working on logical decoding on standby support for
pg11 too, but even if we can get that in place it'll only work for
reachable, online standbys. Every application that uses logical decoding
will have to maintain a directory of standbys (which it has no way to ask
the master for) and advance their slots via extra walsender connections.
They'll do a bunch of unnecessary work decoding WAL they don't need to just
to throw the data away. It won't help for PITR and snapshot use cases at
all. So for now I'm not able to allocate much priority to that.

I'd love to get failover slots in, I still think it's the simplest and best
way to do what users need. It doesn't stop us progressing with decoding on
standby or paint us into any corners.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2017-08-09 15:42:00
Message-ID:	CA+Tgmob=sT4am9k71Rz465vBj9-hWoMdrFLrP4aE5Nr3pkmYBw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 8, 2017 at 4:00 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
>> - When a standby connects to a master, it can optionally supply a list
>> of slot names that it cares about.
>
> Wouldn't that immediately exclude use for PITR and snapshot recovery? I have
> people right now who want the ability to promote a PITR-recovered snapshot
> into place of a logical replication master and have downstream peers replay
> from it. It's more complex than that, as there's a resync process required
> to recover changes the failed node had sent to other peers but isn't
> available in the WAL archive, but that's the gist.
>
> If you have a 5TB database do you want to run an extra replica or two
> because PostgreSQL can't preserve slots without a running, live replica?
> Your SAN snapshots + WAL archiving have been fine for everything else so
> far.

OK, so what you're basically saying here is that you want to encode
the failover information in the write-ahead log rather than passing it
at the protocol level, so that if you replay the write-ahead log on a
time delay you get the same final state that you would have gotten if
you had replayed it immediately. I hadn't thought about that
potential advantage, and I can see that it might be an advantage for
some reason, but I don't yet understand what the reason is. How would
you imagine using any version of this feature in a PITR scenario? If
you PITR the master back to an earlier point in time, I don't see how
you're going to manage without resyncing the replicas, at which point
you may as well just drop the old slot and create a new one anyway.
Maybe you're thinking of a scenario where we PITR the master and also
use PITR to rewind the replica to a slightly earlier point? But I
can't quite follow what you're thinking about. Can you explain
further?

> Requiring live replication connections could also be an issue for service
> interruptions, surely? Unless you persist needed knowledge in the physical
> replication slot used by the standby to master connection, so the master can
> tell the difference between "downstream went away for while but will come
> back" and "downstream is gone forever, toss out its resources."

I don't think the master needs to retain any resources on behalf of
the failover slot. If the slot has been updated by feedback from the
associated standby, then the master can toss those resources
immediately. When the standby comes back on line, it will find out
via a protocol message that it can fast-forward the slot to whatever
the new LSN is, and any WAL files before that point are irrelevant on
both the master and the standby.

> Also, what about cascading? Lots of "pull" model designs I've looked at tend
> to fall down in cascaded environments. For that matter so do failover slots,
> but only for the narrower restriction of not being able to actually decode
> from a failover-enabled slot on a standby, they still work fine in terms of
> cascading down to leaf nodes.

I don't see the problem. The cascaded standby tells the standby "I'm
interested in the slot called 'craig'" and the standby says "sure,
I'll tell you whenever 'craig' gets updated" but it turns out that
'craig' is actually a failover slot on that standby, so that standby
has said to the master "I'm interested in the slot called 'craig'" and
the master is therefore sending updates to that standby. Every time
the slot is updated, the master tells the standby and the standby
tells the cascaded standby and, well, that all seems fine.

Also, as Andres pointed out upthread, if the state is passed through
the protocol, you can have a slot on a standby that cascades to a
cascaded standby; if the state is passed through the WAL, all slots
have to cascade from the master. Generally, with protocol-mediated
failover slots, you can have a different set of slots on every replica
in the cluster and create, drop, and reconfigure them any time you
like. With WAL-mediated slots, all failover slots must come from the
master and cascade to every standby you've got, which is less
flexible.

I don't want to come on too strong here. I'm very willing to admit
that you may know a lot more about this than me and I am really
extremely happy to benefit from that accumulated knowledge. If you're
saying that WAL-mediated slots are a lot better than protocol-mediated
slots, you may well be right, but I don't yet understand the reasons,
and I want to understand the reasons. I think this stuff is too
important to just have one person saying "here's a patch that does it
this way" and everybody else just says "uh, ok". Once we adopt some
proposal here we're going to have to continue supporting it forever,
so it seems like we'd better do our best to get it right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2017-08-10 06:38:46
Message-ID:	CAMsr+YGfaT1N_0cofrDh8ePu604GRnziweJLYQjTk=O3zH=uog@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9 August 2017 at 23:42, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Aug 8, 2017 at 4:00 AM, Craig Ringer <craig(at)2ndquadrant(dot)com>
> wrote:
> >> - When a standby connects to a master, it can optionally supply a list
> >> of slot names that it cares about.
> >
> > Wouldn't that immediately exclude use for PITR and snapshot recovery? I
> have
> > people right now who want the ability to promote a PITR-recovered
> snapshot
> > into place of a logical replication master and have downstream peers
> replay
> > from it. It's more complex than that, as there's a resync process
> required
> > to recover changes the failed node had sent to other peers but isn't
> > available in the WAL archive, but that's the gist.
> >
> > If you have a 5TB database do you want to run an extra replica or two
> > because PostgreSQL can't preserve slots without a running, live replica?
> > Your SAN snapshots + WAL archiving have been fine for everything else so
> > far.
>
> OK, so what you're basically saying here is that you want to encode
> the failover information in the write-ahead log rather than passing it
> at the protocol level, so that if you replay the write-ahead log on a
> time delay you get the same final state that you would have gotten if
> you had replayed it immediately. I hadn't thought about that
> potential advantage, and I can see that it might be an advantage for
> some reason, but I don't yet understand what the reason is. How would
> you imagine using any version of this feature in a PITR scenario? If
> you PITR the master back to an earlier point in time, I don't see how
> you're going to manage without resyncing the replicas, at which point
> you may as well just drop the old slot and create a new one anyway.
>

I've realised that it's possible to work around it in app-space anyway. You
create a new slot on a node before you snapshot it, and you don't drop this
slot until you discard the snapshot. The existence of this slot ensures
that any WAL generated by the node (and replayed by PITR after restore)
cannot clobber needed catalog_xmin. If we xlog catalog_xmin advances or
have some other safeguard in place, which we need for logical decoding on
standby to be safe anyway, then we can fail gracefully if the user does
something dumb.

So no need to care about this.

(What I wrote previously on this was):

You definitely can't just PITR restore and pick up where you left off.

You need a higher level protocol between replicas to recover. For example,
in a multi-master configuration, this can be something like (simplified):

* Use the timeline history file to find the lsn at which we diverged from
our "future self", the failed node
* Connect to the peer and do logical decoding, with a replication origin
filter for "originating from me", for xacts from the divergence lsn up to
the peer's current end-of-wal.
* Reset peer's replication origin for us to our new end-of-wal, and resume
replication

To enable that to be possible, since we can't rewind slots once confirmed
advanced, maintain a backup slot on the peer corresponding to the
point-in-time at which a snapshot was taken.

For most other situations there is little benefit vs just re-creating the
slot before you permit write user-initiated write xacts to begin on the
restored node.

I can accept an argument that "we" as pgsql-hackers do not consider this
something worth caring about, should that be the case. It's niche enough
that you could argue it doesn't have to be supportable in stock postgres.

Maybe you're thinking of a scenario where we PITR the master and also
> use PITR to rewind the replica to a slightly earlier point?

That can work, but must be done in lock-step. You have to pause apply on
both ends for long enough to snapshot both, otherwise the replicaion
origins on one end get out of sync with the slots on another.

Interesting, but I really hope nobody's going to need to do it.

> But I
> can't quite follow what you're thinking about. Can you explain
> further?
>

Gladly.

I've been up to my eyeballs in this for years now, and sometimes it becomes
quite hard to see the outside perspective, so thanks for your patience.

>
> > Requiring live replication connections could also be an issue for service
> > interruptions, surely? Unless you persist needed knowledge in the
> physical
> > replication slot used by the standby to master connection, so the master
> can
> > tell the difference between "downstream went away for while but will come
> > back" and "downstream is gone forever, toss out its resources."
>
> I don't think the master needs to retain any resources on behalf of
> the failover slot. If the slot has been updated by feedback from the
> associated standby, then the master can toss those resources
> immediately. When the standby comes back on line, it will find out
> via a protocol message that it can fast-forward the slot to whatever
> the new LSN is, and any WAL files before that point are irrelevant on
> both the master and the standby.
>

OK, so you're envisioning that every slot on a downstream has a mirror slot
on the upstream, and that is how the master retains the needed resources.

> > Also, what about cascading? Lots of "pull" model designs I've looked at
> tend
> > to fall down in cascaded environments. For that matter so do failover
> slots,
> > but only for the narrower restriction of not being able to actually
> decode
> > from a failover-enabled slot on a standby, they still work fine in terms
> of
> > cascading down to leaf nodes.
>
> I don't see the problem. The cascaded standby tells the standby "I'm
> interested in the slot called 'craig'" and the standby says "sure,
> I'll tell you whenever 'craig' gets updated" but it turns out that
> 'craig' is actually a failover slot on that standby, so that standby
> has said to the master "I'm interested in the slot called 'craig'" and
> the master is therefore sending updates to that standby. Every time
> the slot is updated, the master tells the standby and the standby
> tells the cascaded standby and, well, that all seems fine.
>

Yep, so again, you're pushing slots "up" the tree, by name, with a 1:1
correspondence, and using globally unique slot names to manage state.

If slot names collide, you presumably fail with "er, don't do that then".
Or scrambling data horribly. Both of which we certainly have precedent for
in Pg (see, e.g, what happens if two snapshots of the same node are in
archive recovery and promote to the same timeline, then start archiving to
the same destination...). So not a showstopper.

I'm pretty OK with that.

Also, as Andres pointed out upthread, if the state is passed through
> the protocol, you can have a slot on a standby that cascades to a
> cascaded standby; if the state is passed through the WAL, all slots
> have to cascade from the master.

Yes, that's my main hesitation with the current failover slots, as
mentioned in the prior message.

> Generally, with protocol-mediated
> failover slots, you can have a different set of slots on every replica
> in the cluster and create, drop, and reconfigure them any time you
> like. With WAL-mediated slots, all failover slots must come from the
> master and cascade to every standby you've got, which is less
> flexible.
>

Definitely agreed.

Different standbys don't know about each other so it's the user's job to
make sure they ensure uniqueness, using slot name as a key.

I don't want to come on too strong here. I'm very willing to admit
> that you may know a lot more about this than me and I am really
> extremely happy to benefit from that accumulated knowledge.

The flip side is that I've also been staring at the problem, on and off,
for WAY too long. So other perspectives can be really valuable.

> If you're
> saying that WAL-mediated slots are a lot better than protocol-mediated
> slots, you may well be right, but I don't yet understand the reasons,
> and I want to understand the reasons. I think this stuff is too
> important to just have one person saying "here's a patch that does it
> this way" and everybody else just says "uh, ok". Once we adopt some
> proposal here we're going to have to continue supporting it forever,
> so it seems like we'd better do our best to get it right.

I mostly agree there. We could have relatively easily converted WAL-based
failover slots to something else in a major version bump, and that's why I
wanted to get them in place for 9.6 and then later for pg10. Because people
were (and are) constantly asking me and others who work on logical
replication tools why it doesn't work, and a 90% solution that doesn't
paint us into a corner seemed just fine.

I'm quite happy to find a better one. But I cannot spend a lot of time
writing something to have it completely knocked back because the scope just
got increased again and now it has to do more, so it needs another rewrite.

So, how should this look if we're using the streaming rep protocol?

How about:

A "failover slot" is identified by a field in the slot struct and exposed
in pg_replication_slots. It can be null (not a failover slots). It can
indicate that the slot was created locally and is "owned" by this node; all
downstreams should mirror it. It can also indicate that it is a mirror of
an upstream, in which case clients may not replay from it until it's
promoted to an owned slot and ceases to be mirrored. Attempts to replay
from a mirrored slot just ERROR and will do so even once decoding on
standby is supported.

This promotion happens automatically if a standby is promoted to a master,
and can also be done manually via sql function call or walsender command to
allow for an internal promotion within a cascading replica chain.

When a replica connects to an upstream it asks via a new walsender msg
"send me the state of all your failover slots". Any local mirror slots are
updated. If they are not listed by the upstream they are known deleted, and
the mirror slots are deleted on the downstream.

The upstream walsender then sends periodic slot state updates while
connected, so replicas can advance their mirror slots, and in turn send
hot_standby_feedback that gets applied to the physical replication slot
used by the standby, freeing resources held for the slots on the master.

There's one big hole left here. When we create a slot on a cascading leaf
or inner node, it takes time for hot_standby_feedback to propagate the
needed catalog_xmin "up" the chain. Until the master has set the needed
catalog_xmin on the physical slot for the closest branch, the inner node's
slot's catalog_xmin can only be tentative pending confirmation. That's what
a whole bunch of gruesomeness in the decoding on standby patch was about.

One possible solution to this is to also mirror slots "up", as you alluded
to: when you create an "owned" slot on a replica, it tells the master at
connect time / slot creation time "I have this slot X, please copy it up
the tree". The slot gets copied "up" to the master via cascading layers
with a different failover slot type indicating it's an up-mirror. Decoding
clients aren't allowed to replay from an up-mirror slot and it cannot be
promoted like a down-mirror slot can, it's only there for resource
retention. A node knows its owned slot is safe to actually use, and is
fully created, when it sees the walsender report it in the list of failover
slots from the master during a slot state update.

This imposes some restrictions:

* failover slot names must be globally unique or things go "kaboom"
* if a replica goes away, its up-mirror slots stay dangling until the admin
manually cleans them up

Tolerable, IMO. But we could fix the latter by requiring that failover
slots only be enabled when the replica uses a physical slot to talk to the
upstream. The up-mirror failover slots then get coupled to the physical
slot by an extra field in the slot struct holding the name of the owning
physical slot. Dropping that physical slot cascade-drops all up-mirror
slots automatically. Admins are prevented from dropping up-mirror slots
manually, which protects against screwups.

We could even fix the naming, maybe, with some kind of qualified naming
based on the physical slot, but it's not worth the complexity.

It sounds a bit more complex than your sketch, but I think the 4
failover-kinds are necessary to support this. We'll have:

* not a failover slot, purely local

* a failover slot owned by this node (will be usable for decoding on
standby once supported)

* an up-mirror slot, not promoteable, resource retention only, linked to a
physical slot for a given replica

* a down-mirror slot, promoteable, not linked to a physical slot; this is
the true "failover slot"'s representation on a replica.

Thoughts? Feels pretty viable to me.

Thanks for the new perspective.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2017-08-10 17:02:17
Message-ID:	CA+Tgmob+GdtfKgkTCRDTVnJQ-j-G_TYE+-OVq-t9qgLaG+BPYA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 10, 2017 at 2:38 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> Yep, so again, you're pushing slots "up" the tree, by name, with a 1:1
> correspondence, and using globally unique slot names to manage state.

Yes, that's what I'm imagining. (Whether I should instead be
imagining something else is the important question.)

> I'm quite happy to find a better one. But I cannot spend a lot of time
> writing something to have it completely knocked back because the scope just
> got increased again and now it has to do more, so it needs another rewrite.

Well, I can't guarantee anything about that. I don't tend to argue
against designs to which I myself previously agreed, but other people
may, and there's not a lot I can do about that (although sometimes I
try to persuade them that they're wrong, if I think they are). Of
course, sometimes you implement something and it doesn't look as good
as you thought it would; that's a risk of software development
generally. I'd like to push back a bit on the underlying assumption,
though: I don't think that there was ever an agreed-upon design on
this list for failover slots before the first patch showed up. Well,
anybody's welcome to write code without discussion and drop it to the
list, but if people don't like it, that's the risk you took by not
discussing it first.

> A "failover slot" is identified by a field in the slot struct and exposed
> in pg_replication_slots. It can be null (not a failover slots). It can
> indicate that the slot was created locally and is "owned" by this node; all
> downstreams should mirror it. It can also indicate that it is a mirror of an
> upstream, in which case clients may not replay from it until it's promoted
> to an owned slot and ceases to be mirrored. Attempts to replay from a
> mirrored slot just ERROR and will do so even once decoding on standby is
> supported.

> This promotion happens automatically if a standby is promoted to a master,
> and can also be done manually via sql function call or walsender command to
> allow for an internal promotion within a cascading replica chain.

+1.

> When a replica connects to an upstream it asks via a new walsender msg "send
> me the state of all your failover slots". Any local mirror slots are
> updated. If they are not listed by the upstream they are known deleted, and
> the mirror slots are deleted on the downstream.

What about slots not listed by the upstream that are currently in use?

> The upstream walsender then sends periodic slot state updates while
> connected, so replicas can advance their mirror slots, and in turn send
> hot_standby_feedback that gets applied to the physical replication slot used
> by the standby, freeing resources held for the slots on the master.

+1.

> There's one big hole left here. When we create a slot on a cascading leaf or
> inner node, it takes time for hot_standby_feedback to propagate the needed
> catalog_xmin "up" the chain. Until the master has set the needed
> catalog_xmin on the physical slot for the closest branch, the inner node's
> slot's catalog_xmin can only be tentative pending confirmation. That's what
> a whole bunch of gruesomeness in the decoding on standby patch was about.
>
> One possible solution to this is to also mirror slots "up", as you alluded
> to: when you create an "owned" slot on a replica, it tells the master at
> connect time / slot creation time "I have this slot X, please copy it up the
> tree". The slot gets copied "up" to the master via cascading layers with a
> different failover slot type indicating it's an up-mirror. Decoding clients
> aren't allowed to replay from an up-mirror slot and it cannot be promoted
> like a down-mirror slot can, it's only there for resource retention. A node
> knows its owned slot is safe to actually use, and is fully created, when it
> sees the walsender report it in the list of failover slots from the master
> during a slot state update.

I'm not sure that this actually prevents the problem you describe. It
also seems really complicated. Maybe you can explain further; perhaps
there is a simpler solution (or perhaps this isn't as complicated as I
currently think it is).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2017-08-11 03:03:40
Message-ID:	CAMsr+YGbc7vnf+BAEYxbEGG6=4hp=yJ5YXTgZAFm2fqyGa1hew@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11 August 2017 at 01:02, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Well,
> anybody's welcome to write code without discussion and drop it to the
> list, but if people don't like it, that's the risk you took by not
> discussing it first.
>

Agreed, patches materializing doesn't mean they should be committed, and
there wasn't prior design discussion on this.

It can be hard to elicit it without a patch, but clearly not always, we're
doing a good job of it here.

> > When a replica connects to an upstream it asks via a new walsender msg
> "send
> > me the state of all your failover slots". Any local mirror slots are
> > updated. If they are not listed by the upstream they are known deleted,
> and
> > the mirror slots are deleted on the downstream.
>
> What about slots not listed by the upstream that are currently in use?
>

Yes, it'll also need to send a list of its local owned and up-mirrored
failover slots to the upstream so the upstream can create them or update
their state.

> There's one big hole left here. When we create a slot on a cascading leaf
> or
> > inner node, it takes time for hot_standby_feedback to propagate the
> needed
> > catalog_xmin "up" the chain. Until the master has set the needed
> > catalog_xmin on the physical slot for the closest branch, the inner
> node's
> > slot's catalog_xmin can only be tentative pending confirmation. That's
> what
> > a whole bunch of gruesomeness in the decoding on standby patch was about.
> >
> > One possible solution to this is to also mirror slots "up", as you
> alluded
> > to: when you create an "owned" slot on a replica, it tells the master at
> > connect time / slot creation time "I have this slot X, please copy it up
> the
> > tree". The slot gets copied "up" to the master via cascading layers with
> a
> > different failover slot type indicating it's an up-mirror. Decoding
> clients
> > aren't allowed to replay from an up-mirror slot and it cannot be promoted
> > like a down-mirror slot can, it's only there for resource retention. A
> node
> > knows its owned slot is safe to actually use, and is fully created, when
> it
> > sees the walsender report it in the list of failover slots from the
> master
> > during a slot state update.
>
> I'm not sure that this actually prevents the problem you describe. It
> also seems really complicated. Maybe you can explain further; perhaps
> there is a simpler solution (or perhaps this isn't as complicated as I
> currently think it is).

It probably sounds more complex than it is. A slot is created tentatively
and marked not ready to actually use yet when created on a standby. It
flows "up" to the master where it's created as permanent/ready. The
permanent/ready state flows back down to the creator.

When we see a temp slot become permanent we copy the
restart_lsn/catalog_xmin/confirmed_flush_lsn from the upstream slot in case
the master had to advance them from our tentative values when it created
the slot. After that, slot state updates only flow "out" from the owner: up
the tree for up-mirror slots, down the tree for down-mirror slots.

Diagram may help. I focused only on the logical slot created on standby
case, since I think we're happy with the rest already and I don't want to
complicate it.

GMail will probably HTMLize this, sorry:

(also avail as
https://gist.github.com/ringerc/d4a8fe97f5fd332d8b883d596d61e257 )

To actually use the slot once decoding on standby is supported: a decoding
client on "C" can consume xacts and cause slot "X" to advance catalog_xmin,
confirmed_flush_lsn, etc. walreceiver on "C" will tell walsender on "B"
about the new slot state, and it'll get synced up-tree, then B will tell A.

Since slot is already marked permanent, state won't get copied back
down-tree, that only happens once when slot is first fully created on
master.

Some node "D" can exist as a phys rep of "C". If C fails and is replace
with D, admin can promote the down-mirror slot on "D" to an owned slot.

Make sense?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2017-08-12 00:03:19
Message-ID:	20170812000319.4higzrilqyxstbko@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2017-08-02 16:35:17 -0400, Robert Haas wrote:
> I actually think failover slots are quite desirable, especially now
> that we've got logical replication in core. In a review of this
> thread I don't see anyone saying otherwise. The debate has really
> been about the right way of implementing that.

Given that I presumably was one of the people pushing back more
strongly: I agree with that. Besides disagreeing with the proposed
implementation our disagreements solely seem to have been about
prioritization.

I still think we should have a halfway agreed upon *design* for logical
failover, before we introduce a concept that's quite possibly going to
be incompatible with that, however. But that doesn't mean it has to
submitted/merged to core.

> - When a standby connects to a master, it can optionally supply a list
> of slot names that it cares about.
> - The master responds by periodically notifying the standby of changes
> to the slot contents using some new replication sub-protocol message.
> - The standby applies those updates to its local copies of the slots.

I think there's a couple design goals we need to agree upon, before
going into the weeds of how exactly we want this to work. Some of the
axis I can think of are:

- How do we want to deal with cascaded setups, do slots have to be
available everywhere, or not?
- What kind of PITR integration do we want? Note that simple WAL based
slots do *NOT* provide proper PITR support, there's not enough
interlock easily available (you'd have to save slots at the end, then
increment minRecoveryLSN to a point later than the slot saving)
- How much divergence are we going to accept between logical decoding on
standbys, and failover slots. I'm probably a lot closer to closer than
than Craig is.
- How much divergence are we going to accept between infrastructure for
logical failover, and logical failover via failover slots (or however
we're naming this)? Again, I'm probably a lot closer to zero than
craig is.

Greetings,

Andres Freund

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2017-08-14 03:56:28
Message-ID:	CAMsr+YEupkmQR2zogMqySeskbNv24AhymWMQHbguJzQxCvZrow@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12 August 2017 at 08:03, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2017-08-02 16:35:17 -0400, Robert Haas wrote:
> > I actually think failover slots are quite desirable, especially now
> > that we've got logical replication in core. In a review of this
> > thread I don't see anyone saying otherwise. The debate has really
> > been about the right way of implementing that.
>
> Given that I presumably was one of the people pushing back more
> strongly: I agree with that. Besides disagreeing with the proposed
> implementation our disagreements solely seem to have been about
> prioritization.
>
> I still think we should have a halfway agreed upon *design* for logical
> failover, before we introduce a concept that's quite possibly going to
> be incompatible with that, however. But that doesn't mean it has to
> submitted/merged to core.
>

How could it be incompatible? The idea here is to make physical failover
transparent to logical decoding clients. That's not meant to sound
confrontational, I mean that I can't personally see any way it would be and
could use your ideas.

I understand that it might be *different* and you'd like to see more
closely aligned approaches that work more similarly. For which we first
need to know more clearly how logical failover will look. But it's hard not
to also see this as delaying and blocking until your preferred approach via
pure logical rep and logical failover gets in, and physical failover can be
dismissed with "we don't need that anymore". I'm sure that's not your
intent, I just struggle not to see it that way anyway when there's always
another reason not to proceed to solve this problem because of a loosely
related development effort on another problem.

I think there's a couple design goals we need to agree upon, before
> going into the weeds of how exactly we want this to work. Some of the
> axis I can think of are:
>
> - How do we want to deal with cascaded setups, do slots have to be
> available everywhere, or not?
>

Personally, I don't care either way.

> - What kind of PITR integration do we want? Note that simple WAL based
> slots do *NOT* provide proper PITR support, there's not enough
> interlock easily available (you'd have to save slots at the end, then
> increment minRecoveryLSN to a point later than the slot saving)
>

Interesting. I haven't fully understood this, but think I see what you're
getting at.

As outlined in the prior mail, I'd like to have working PITR with logical
slots but think it's pretty niche as it can't work usefully without plenty
of co-operation from the rest of the logical replication software in use.
You can't just restore and resume normal operations. So I don't think it's
worth making it a priority.

It's possible to make PITR safe with slots by blocking further advance of
catalog_xmin on the running master for the life of the PITR base backup
using a slot for retention. There's plenty of room for operator error
until/unless we add something like catalog_xmin advance xlog'ing, but it
can be done now with external tools if you're careful. Details in the prior
mail.

I don't think PITR for logical slots is important given there's a
workaround and it's not simple to actually do anything with it if you have
it.

> - How much divergence are we going to accept between logical decoding on
> standbys, and failover slots. I'm probably a lot closer to closer than
> than Craig is.
>

They're different things to me, but I think you're asking "to what extent
should failover slots functionality be implemented strictly on top of
decoding on standby?"

"Failover slots" provides a mechanism by which a logical decoding client
can expect a slot it creates on a master (or physical streaming replica
doing decoding on standby) to continue to exist. The client can ignore
physical HA and promotions of the master, which can continue to be managed
using normal postgres tools. It's the same as, say, an XA transaction
manager expecting that if your master dies and you fail over to a standby,
the TM should't have to have been doing special housekeeping on the
promotion candidate before promotion in order for 2PC to continue to work.
It Just Works.

Logical decoding on standby is useful with, or without, failover slots, as
you can use it to extract data from a replica, and now decoding timeline
following is in a decoding connection on a replica will survive promotion
to master.

But in addition to its main purpose of allowing logical decoding from a
standby server to offload work, it can be used to implement client-managed
support for failover to physical replicas. For this, the client must have
an inventory of promotion-candidates of the master and their connstrings so
it can maintain slots on them too. The client must be able to connect to
all promotion-candidates and advance their slots via decoding along with
the master slots it's actually replaying from. If a client isn't "told"
about a promotion candidate, decoding will break when we fail over. If a
client cannot connect to a promotion candidate, catalog_xmin will fall
behind on master until the replica is discarded (and its physical slot
dropped) or the client regains access. Every different logical decoding
client application must implement all this logic and management separately.

It may be possible to implement failover-slots like functionality based on
decoding on standby in an app transparent way, by having the replica
monitor slot states on the master and self-advance its own slots by
loopback decoding connection. Or the master could maintain an inventory of
replicas and make decoding connections to them where it advances their
slots after the masters' slots are advanced by an app. But either way, why
would we want to do this? Why actually decode WAL and use the logical
decoding machinery when we *know* the state of the system because only the
master is writeable?

The way I see it, to provide failover slots functionality we'd land up with
something quite similar to what Robert and I just discussed, but the slot
advance would be implemented using decoding (on standby) instead of
directly setting slot state. What benefit does that offer?

I don't want to block failover slots on decoding on standby just because
decoding on standby would be nice to have.

> - How much divergence are we going to accept between infrastructure for
> logical failover, and logical failover via failover slots (or however
> we're naming this)? Again, I'm probably a lot closer to zero than
> craig is.

We don't have logical failover, let alone mature, tested logical failover
that covers most of Pg's available functionality. Nor much of a design for
it AFAIK. There is no logical failover to diverge from, and I don't want to
block physical failover support on that.

But, putting that aside to look at the details of how logical failover
might work, what sort of commonality do you expect to see? Physical
failover is by WAL replication using archive recovery/streaming, managed
via recovery.conf, with unilateral promotion by trigger file/command. The
admin is expected to ensure that any clients and cascading replicas get
redirected to the promoted node and the old one is fenced - and we don't
care if that's done by IP redirection or connstring updates or what. Per
the proposal Robert and I discussed, logical slots will be managed by
having the walsender/walreceiver exchange slot state information that
cascades up/down the replication tree via mirror slot creations.

How's logical replica promotion going to work? Here's one possible way, of
many: the promotion-candidate logical replica consumes an unfiltered xact
stream that contains changes from all nodes, not just its immediate
upstream. Downstreams of the master can maintain direct connections to the
promotion candidate and manage their own slots directly, sending flush
confirmations for slots on the promotion candidate as they see their
decoding sessions on the replica decode commits for LSNs the clients sent
flush confirmations to the master for. On promotion, the master's
downstreams would be reconfigured to connect to the node-id of the newly
promoted master and would begin decoding from it in catchup mode, where
they receive the commits from the old master via the new master, until they
reach the new master's end-of-wal at time of promotion. With some tweaks
like a logical WAL message recording the moment of promotion, it's not that
different to the client-managed physical failover model.

It can also be converted to a more transparent failover-slots like model by
having the promotion candidate physical replica clone slots from its
upstream, but advance them by loopback decoding - not necessarily actual
network loopback. It'd use a filter that discards data and only sees the
commit XIDs + LSNs. It'd send confirmations on the slots when the local
slot processed a commit for which the upstream's copy of the slot had a
confirmation for that lsn. On promotion, replicas would connect with new
replorigins (0) and let decoding start at the slot positions on the
replica. The master->replica slot state reporting can be done via the
walsender too, just as proposed for the physical case, though no
replica->master reporting would be needed for logical failover.

So despite my initial expectations they can be moderately similar in broad
structure. But I don't think there's going to be much actual code overlap
beyond minor things like both wanting a way to query slot state on the
upstream. Both *could* use decoding on standby to advance slot positions,
but for the physical case that's just a slower (and unfinished) way to do
what we already have, wheras it's necessary for logical failover.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: Failover Slots
Date:	2017-09-07 04:10:05
Message-ID:	CAMsr+YGX_p9M9mj8X1ExWAWVjis6bQHjqPmUVhZA2_KkYYJ0EQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 14 August 2017 at 11:56, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:

>
> I don't want to block failover slots on decoding on standby just because
> decoding on standby would be nice to have.
>

However, during discussion with Tomas Munro a point has come up that does
block failover slots as currently envisioned - silent timeline divergence.
It's a solid reason why the current design and implementation is
insufficient to solve the problem. This issue exists both with the original
failover slots and with the model Robert and I were discussing.

Say a decoding client has replayed from master up to commit of xid 42 at
1/1000 and confirmed flush, then a failover slots standby of the master is
promoted. The standby has only received WAL from the failed master up to
1/500 with most recent xid 20. Now the standby does some other new xacts,
pushing xid up to 30 at 1/1000 then continuing to insert until xid 50 at
lsn 1/2000.

Then the logical client reconnects. The logical client will connect to the
failover slot fine, and start replay. But it'll ask for replay to start at
1/1000. The standby will happily fast-forward the slot (as it should), and
start replay after 1/1000.

But now we have silent divergence in timelines. The logical replica has
received and committed xacts 20...42 at lsn 1/500 through 1/1000, but these
are not present on the promoted master. And the replica has skipped over
the new-master's xids 20...30 with lsns 1/500 through 1/1000, so they're
present on the new master but not the replica.

IMO, this shows that not including the timeline in replication origins was
a bit of a mistake, since we'd trivially detect this if they were included
- but it's a bit late now. And anyway, detection would just mean logical
rep would break, which doesn't help much.

The simplest fix, but rather limited, is to require that failover
candidates be in synchronous_standby_names, and delay ReorderBufferCommit
sending the actual commit message until all peers in s_s_n confirm flush of
the commit lsn. But that's not much good if you want sync rep for your
logical connections too, and is generally a hack.

A more general solution requires that masters be told which peers are
failover candidates, so they can ensure ordering between logical decoding
and physical failover candidates. Which effectively adds another kind of
sync rep, where we do "wait for physical failover candidates to flush, and
only then allow logical decoding". This actually seems pretty practical
with the design Robert and I discussed, but it's definitely an expansion in
scope.

Alternately, we could require the decoding clients to keep an eye on the
flush/replay positions of all failover candidates and delay commit+confirm
of decoded xacts until the upstream's failover candidates have received and
flushed up to that lsn. Theat starts to look at lot like a decoding on
standby based model for logical failover, where the downstream maintains
slots on each failover candidate upstream.

So yeah. More work needed here. Even if we suddenly decided the original
failover slots model was OK, it's not sufficient to fully solve the problem.

(It's something I'd thought for BDR failover, but never applied to falover
slots: the problem of detecting or preventing divergence when the logical
client is ahead of physical receive at the time the physical standby is
promoted.)

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services