sync rep and smart shutdown

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: sync rep and smart shutdown
Date: 2011-04-08 18:08:17
Message-ID: BANLkTi=W8OrvqLHS+suU8R2b_rhFaqeEaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

There is an open item for synchronous replication and smart shutdown,
with a link to here:

http://archives.postgresql.org/pgsql-hackers/2011-03/msg01391.php

The issue is not straightforward, however, so I want to get some
broader input before proceeding. In short, the problem is that if
synchronous replication is in use, no standbys are connected, and a
smart shutdown is requested, any future commits will wait for a
wake-up that will never come, because by that point postmaster is no
longer accepting connections - thus no standby can reconnect to
release waiters. Or, if there is a standby connected when the smart
shutdown is requested, but it subsequently gets disconnected, it won't
be able to reconnect, and again all waiters will get stuck.

There are a couple of plausible ways to proceed here:

1. Do nothing. If this happens to you, you will need to request fast
or immediate shutdown to get the system unstuck. Since it's pretty
easy for this to happen already anyway (all you need is one connection
to sit open doing nothing), most people probably already have
provision for this and likely wouldn't be terribly inconvenienced by
one more corner case. On the flip side, I would rather that we were
moving in the direction of making it more likely for smart shutdown to
actually shut down the system, rather than less likely.

2. When a smart shutdown is initiated, shut off synchronous
replication. This definitely makes sure you won't get stuck waiting
for sync rep, but on the other hand you probably configured sync rep
because you wanted, uh, sync rep. Or alternatively, continue to allow
sync rep for as long as there is a sync standby connected, but if the
last sync standby drops off then shut it off.

3. Accept new replication connections even when the system is
undergoing a smart shutdown. This is the approach that the
above-linked patch tries to take, and it seems superficially sensible,
but it doesn't really work. Currently, once a shutdown has been
initiated and any on-line backup has been stopped, we stop creating
regular backends; we instead only create dead-end backends that just
return an error message and exit. Once no regular backends remain, we
then stop accepting connections AT ALL and wait for the dead end
backends to drain out. What this patch proposes to do (though it
isn't real clear from the way it's written) is continue creating
regular backends but boot out all but superuser and replication
connections as soon as possible. However, that misses the reason why
the current code works the way that it does: to make sure that even in
the face of a continuing stream of connection requests, we actually
eventually manage to stop talking and shut down. Basically, this
patch would fix the smart-shutdown-sync-rep interaction at the expense
of making smart shutdown considerably more fragile in other cases,
which does not seem like a good trade-off. AFAICT, this whole
approach is doomed to failure.

Anyone else have an idea or opinion?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: sync rep and smart shutdown
Date: 2011-04-08 18:38:52
Message-ID: 15982.1302287932@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> There is an open item for synchronous replication and smart shutdown,
> with a link to here:
> http://archives.postgresql.org/pgsql-hackers/2011-03/msg01391.php

> There are a couple of plausible ways to proceed here:

> 1. Do nothing.

> 2. When a smart shutdown is initiated, shut off synchronous
> replication.

> 3. Accept new replication connections even when the system is
> undergoing a smart shutdown.

I agree that #3 is impractical and #2 is a bad idea, which seems to
leave us with #1 (unless anyone has a #4)? This is probably just
something we should figure is going to be one of the rough edges
in the first release of sync rep.

A #4 idea did just come to mind: once we realize that there are no
working replication connections, automatically do a fast shutdown
instead, ie, forcibly roll back those transactions that are never
gonna complete. Or at least have the postmaster bleat about it.
But I'm not sure what it'd take to code that, and am also unsure
that it's something to undertake at this stage of the cycle.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: sync rep and smart shutdown
Date: 2011-04-08 18:53:17
Message-ID: BANLkTi=MaDDfE_Vmi4t5PpJ1bue3an6sig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Apr 8, 2011 at 2:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> There is an open item for synchronous replication and smart shutdown,
>> with a link to here:
>> http://archives.postgresql.org/pgsql-hackers/2011-03/msg01391.php
>
>> There are a couple of plausible ways to proceed here:
>
>> 1. Do nothing.
>
>> 2. When a smart shutdown is initiated, shut off synchronous
>> replication.
>
>> 3. Accept new replication connections even when the system is
>> undergoing a smart shutdown.
>
> I agree that #3 is impractical and #2 is a bad idea, which seems to
> leave us with #1 (unless anyone has a #4)?  This is probably just
> something we should figure is going to be one of the rough edges
> in the first release of sync rep.

That's kind of where my mind was headed too, although I was (probably
vainly) hoping for a better option.

> A #4 idea did just come to mind: once we realize that there are no
> working replication connections, automatically do a fast shutdown
> instead, ie, forcibly roll back those transactions that are never
> gonna complete.  Or at least have the postmaster bleat about it.
> But I'm not sure what it'd take to code that, and am also unsure
> that it's something to undertake at this stage of the cycle.

Well, you certainly can't do that. By the time a transaction is
waiting for sync rep, it's too late to roll back; the commit record is
already, and necessarily, on disk. But in theory we could notice that
all of the remaining backends are waiting for sync rep, and switch to
a fast shutdown.

Several people have suggested refinements for smart shutdown in
general, such as switching to fast shutdown after a certain number of
seconds, or having backends exit at the end of the current transaction
(or immediately if idle). Such things would both make this problem
less irksome and increase the overall utility of smart shutdown
tremendously. So maybe it's not worth expending too much effort on it
right now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sync rep and smart shutdown
Date: 2011-04-11 02:56:03
Message-ID: BANLkTimCqyEH3YQtVgFFEbQbA5UTzKEUow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Apr 9, 2011 at 3:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> There are a couple of plausible ways to proceed here:
>>
>>> 1. Do nothing.
>>
>>> 2. When a smart shutdown is initiated, shut off synchronous
>>> replication.
>>
>>> 3. Accept new replication connections even when the system is
>>> undergoing a smart shutdown.
>>
>> I agree that #3 is impractical and #2 is a bad idea, which seems to
>> leave us with #1 (unless anyone has a #4)?  This is probably just
>> something we should figure is going to be one of the rough edges
>> in the first release of sync rep.
>
> That's kind of where my mind was headed too, although I was (probably
> vainly) hoping for a better option.

Though I proposed #3, I can live with #1 for now. Even if smart shutdown
gets stuck, we can resolve that by requesting fast shutdown or emptying
synchronous_standby_names.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center