Re: WIP: Failover Slots

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thom Brown <thom(at)linux(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Failover Slots
Date: 2017-08-11 03:03:40
Message-ID: CAMsr+YGbc7vnf+BAEYxbEGG6=4hp=yJ5YXTgZAFm2fqyGa1hew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11 August 2017 at 01:02, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Well,
> anybody's welcome to write code without discussion and drop it to the
> list, but if people don't like it, that's the risk you took by not
> discussing it first.
>

Agreed, patches materializing doesn't mean they should be committed, and
there wasn't prior design discussion on this.

It can be hard to elicit it without a patch, but clearly not always, we're
doing a good job of it here.

> > When a replica connects to an upstream it asks via a new walsender msg
> "send
> > me the state of all your failover slots". Any local mirror slots are
> > updated. If they are not listed by the upstream they are known deleted,
> and
> > the mirror slots are deleted on the downstream.
>
> What about slots not listed by the upstream that are currently in use?
>

Yes, it'll also need to send a list of its local owned and up-mirrored
failover slots to the upstream so the upstream can create them or update
their state.

> There's one big hole left here. When we create a slot on a cascading leaf
> or
> > inner node, it takes time for hot_standby_feedback to propagate the
> needed
> > catalog_xmin "up" the chain. Until the master has set the needed
> > catalog_xmin on the physical slot for the closest branch, the inner
> node's
> > slot's catalog_xmin can only be tentative pending confirmation. That's
> what
> > a whole bunch of gruesomeness in the decoding on standby patch was about.
> >
> > One possible solution to this is to also mirror slots "up", as you
> alluded
> > to: when you create an "owned" slot on a replica, it tells the master at
> > connect time / slot creation time "I have this slot X, please copy it up
> the
> > tree". The slot gets copied "up" to the master via cascading layers with
> a
> > different failover slot type indicating it's an up-mirror. Decoding
> clients
> > aren't allowed to replay from an up-mirror slot and it cannot be promoted
> > like a down-mirror slot can, it's only there for resource retention. A
> node
> > knows its owned slot is safe to actually use, and is fully created, when
> it
> > sees the walsender report it in the list of failover slots from the
> master
> > during a slot state update.
>
> I'm not sure that this actually prevents the problem you describe. It
> also seems really complicated. Maybe you can explain further; perhaps
> there is a simpler solution (or perhaps this isn't as complicated as I
> currently think it is).

It probably sounds more complex than it is. A slot is created tentatively
and marked not ready to actually use yet when created on a standby. It
flows "up" to the master where it's created as permanent/ready. The
permanent/ready state flows back down to the creator.

When we see a temp slot become permanent we copy the
restart_lsn/catalog_xmin/confirmed_flush_lsn from the upstream slot in case
the master had to advance them from our tentative values when it created
the slot. After that, slot state updates only flow "out" from the owner: up
the tree for up-mirror slots, down the tree for down-mirror slots.

Diagram may help. I focused only on the logical slot created on standby
case, since I think we're happy with the rest already and I don't want to
complicate it.

GMail will probably HTMLize this, sorry:

Phys rep Phys rep
using phys using
slot "B" phys slot "C"
+-------+ +--------+ +-------+
T | A <^--------+ B <---------+ C |
I | | | | | |
M +-------+ +--------+ +-------+
E | | |
| | | |CREATEs
| | | |logical slot X
v | | |("owned")
| | |as temp slot
| +<-----------------+
| |Creates upmirror |
| |slot "X" linked |
| |to phys slot "C" |
| |marked temp |
| <----------------+ |
|Creates upmirror | |
<--------------------------+ +-----------------+
|slot "X" linked | | Attempt to
decode from "X" | |
|to phys slot "B" | |
| CLIENT |
|marked permanent | |
+-------------------------> | |
+----------------> | | ERROR: slot X
still being +-----------------+
| |Sees upmirror | created on
master, not ready
| |slot "X" in |
| |list from "A", |
| |marks it |
| |permanent and |
| |copies state |
| +----------------> |
| | |Sees upmirror slot
| | |"X" on "B" got
marked
| | |permanent
(because it
| | |appears in B's
slot
| | |listings),
| | |marks permanent
on C.
| | |Copies state.
| | |
| | |Slot "X" now
persistent
| | |and (when
decoding on standby
| | |supported) can be
used for decoding
| | |on standby.
+ + +

(also avail as
https://gist.github.com/ringerc/d4a8fe97f5fd332d8b883d596d61e257 )

To actually use the slot once decoding on standby is supported: a decoding
client on "C" can consume xacts and cause slot "X" to advance catalog_xmin,
confirmed_flush_lsn, etc. walreceiver on "C" will tell walsender on "B"
about the new slot state, and it'll get synced up-tree, then B will tell A.

Since slot is already marked permanent, state won't get copied back
down-tree, that only happens once when slot is first fully created on
master.

Some node "D" can exist as a phys rep of "C". If C fails and is replace
with D, admin can promote the down-mirror slot on "D" to an owned slot.

Make sense?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-08-11 03:22:07 Re: dubious error message from partition.c
Previous Message Craig Ringer 2017-08-11 02:36:14 Re: Thoughts on unit testing?