Re: unite recovery.conf and postgresql.conf

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: unite recovery.conf and postgresql.conf
Date: 2011-12-14 21:16:56
Message-ID: 4EE91248.8010505@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've submitted two patches for adding new include features to the
postgresql.conf file. While not quite commit quality yet, I hope
everyone will agree their reviews this week suggest both are close
enough that any number of people could finish them off. Before
re-opening this can of worms, I wanted people to be comfortable that we
can expect them to be available as building blocks before 9.2
development is finished. Both of those came out of requests from this
unification thread, and they're a helpful part of what I'd like to propose.

I don't see any path forward here that still expects the recovery.conf
file to function as it used to. The "make replication easy" crowd will
eventually be larger than the pre-9.0 user base, if it isn't already.
And they clearly want no parts of that thing. There's been over a year
of arguing around how to cope with it that will satisfy everyone, so
many messages I couldn't even read them all usefully in our archives and
had to go here:

http://postgresql.1045698.n5.nabble.com/recovery-conf-location-td2854644.html
http://postgresql.1045698.n5.nabble.com/unite-recovery-conf-and-postgresql-conf-td4785717.html

I don't think it's possible. What I would propose is a specification
based on forced failure if there's any sign of recovery.conf, combined
with the simplest migration path we can design to ease upgrades from
older versions. I think we can make the transition easy enough. Users
and tool vendors can make relatively simple changes to support 9.2
without changing everything they're used to just yet--while still being
very clear deprecation has arrived and they should reconsider their
approach. Only bug-compatible levels of backwards compatibility would
make this transparent to them, and there's way too many issues to allow
moving forward that way--a forward path that as far as I can see is
desired by the majority of users, and just as importantly for all of the
potential new users we're impacting with the current mess.

There's another, perhaps under considered, concern I want to highlight
as well. Tom has repeatedly noted that one of the worst problems here
would go away if the "existence means do recovery" nature of
recovery.conf went elsewhere. And we know some packagers want to
separate the necessary to manipulate configuration files from the
database directory, for permissions and management reasons. As Heikki
nicely put it (over a year ago), "You don't want to give monitoring
tools that decide on failover write access to the data directory". Any
information that the user is supplying for the purpose of specifying
things needs to be easy to relocate to a separate config file area,
instead of treating it more like a control file in $PGDATA. Some
chatting this morning with Simon pointed out a second related concern
there, which makes ideas like "specify the path to the recovery.conf
file" infeasible. The data_directory is itself a parameter, so anything
tied to that or a new GUC means that config files specified there we
would need two passes. First identify the data directory, then go back
again to read recovery.conf from somewhere else. And nobody wants to
wander that way. If it doesn't fit cleanly into the existing
postgresql.conf parsing, it's gotta go.

Here's the rough outline of what I think would work here:

-All settings move into the postgresql.conf.

-As of 9.2, relocating the postgresql.conf means there are no user
writable files needed in the data directory.

-Creating a standby.enabled file in the directory that houses the
postgresql.conf (same logic as "include" uses to locate things) puts the
system into recovery mode. That feature needs to save some state, and
making those decisions based on existence of a file is already a thing
we do. Rather than emulating the rename to recovery.done that happens
now, the server can just delete it, to keep from incorrectly returning
to a state it's exited. A UI along the lines of the promote one,
allowing "pg_ctl standby", should fall out of here. I think this is
enough that people who just want to use replication features need never
hear about this file at all, at least during the important to simplify
first run through.

-If startup finds a recovery.conf file where it used to live at,
abort--someone is expecting the old behavior. Hint to RTFM or include a
short migration guide right on the spot. That can have a nice section
about how you might use the various postgresql.conf include* features if
they want to continue managing those files separately. Example: rename
it as replication.conf and use include_if_exists if you want to be able
to rename it to recovery.done like before. Or drop it into a conf.d
directory where the rename will make it then skipped.

-Tools such as pgpool that want to write a simple configuration file,
only touching the things that used to go into recovery.conf, can tell
people to do the same trick. End their postgresql.conf with a call to
\include_if_exists replication.conf as part of setup. While I don't
like pushing problems toward tool vendors, as one I think validating if
this has been done doesn't require the sort of fully GUC compatible
parser people (rightly) want to avoid. A simple scan of the
postgresql.conf looking for the recommended text at the front of a line
could confirm whether that bit is there. And adding a single
"include_if_exists" line to the end of the postgresql.conf is not a
terrible edit job to consider pushing toward tools. None of this is any
more complicated than the little search and replace job that initdb does
right now.

-[Optional: for 9.2 but not forever, report incompatibilities, or
potentially stop working altogether, if the things that used to go into
recovery.conf show up when they aren't going to do anything. That would
match the 'stop if the old way is used' theme, but may introduce its own
new operational issues.]

That's basically it for new ideas. The mechanics of making everything
in recovery.conf turn into a GUC seemed pretty uncontroversial when I
scanned the archives here. Another idea not to forget is getting the
read permissions on the new GUCs correct, like making primary_conninfo
superuser visible only. Seems like some of the restart error handling
might still have a tricky part or two left in it. But if you've
accepted that a clear break from existing behavior is happening, some of
those simplify I think.

If we're moving forward this way for 9.2, we do need to be careful that
a spec is nailed down and people take responsibility for their
respective parts of it. There's a few interlocking parts that won't
work well if partially completed. I marked myself as reviewer for this
patch hoping I could crack the deadlock around the specification. One
open item I'm not going to do is a more serious review of the already
submitted code.

To step back toward the thinking that brought me to here (which has
helped pull Simon a bit closer to the middle of the opinion pack too),
an application facing change to PostgreSQL has serious backward
compatibility requirements. It's really bad if the server runs but will
quietly malfunction, such that you may not notice the change until much
later when a rare code path is hit. But when we're talking about
something that impacts whether the server will start or not, that can be
different. You have a new option. Make sure the break is obvious and
total when it happens: do not function at all if someone tries to
operate the old way. The migration path toward the new one should be as
painless as possible. If people only have to tweak one or two simple
things to get an approximation of the old behavior, provide that, give
migration instructions. But this is a new major version; if people
expected all their old configurations to just move over with zero
changes, that's full of fail in all directions. The best we can do is
try to improve the odds their test migrations will fail early, and make
the path to upgrade as simple and well defined as possible.

(It can be argued that people will not actually run into the proposed
"server blocks if recovery.conf exists" fail-safe during migration
testing. It might hit them only when the first fail-over happens in the
field. Then they've got a production server down and are staring at the
new error message, at best; if they don't see the logs even that may not
be facing them. I would say that someone who migrates a
high-availability system over to a new major version, and doesn't read
the release notes nor do the simplest of fail-over testing, is doomed
regardless of what we do to try and help them.)

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-12-14 21:22:47 Re: Command Triggers
Previous Message Pavel Stehule 2011-12-14 20:52:08 Re: review: CHECK FUNCTION statement