Re: Dynamic Shared Memory stuff

From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Dynamic Shared Memory stuff
Date: 2014-01-21 19:58:21
Message-ID: 20140121195821.GB1929456@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 18, 2013 at 12:21:08PM -0500, Robert Haas wrote:
> On Tue, Dec 10, 2013 at 6:26 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > The larger point is that such a shutdown process has never in the history
> > of Postgres been successful at removing shared-memory (or semaphore)
> > resources. I do not feel a need to put a higher recoverability standard
> > onto the DSM code than what we've had for fifteen years with SysV shmem.
> >
> > But, by the same token, it shouldn't be any *less* recoverable. In this
> > context that means that starting a new postmaster should recover the
> > resources, at least 90% of the time (100% if we still have the old
> > postmaster lockfile). I think the idea of keeping enough info in the
> > SysV segment to permit recovery of DSM resources is a good one. Then,
> > any case where the existing logic is able to free the old SysV segment
> > is guaranteed to also work for DSM.
>
> So I'm taking a look at this. There doesn't appear to be anything
> intrinsically intractable here, but it seems important to time the
> order of operations so as to minimize the chances that things fail in
> awkward places. The point where we remove the old shared memory
> segment from the previous postmaster invocation is here:
>
> /*
> * The segment appears to be from a dead Postgres process, or from a
> * previous cycle of life in this same process. Zap it, if possible.
> * This probably shouldn't fail, but if it does, assume the segment
> * belongs to someone else after all, and continue quietly.
> */
> shmdt(memAddress);
> if (shmctl(shmid, IPC_RMID, NULL) < 0)
> continue;
>
> My first thought was to remember the control segment ID from the
> header just before the shmdt() there, and then later when the DSM
> module is starting, do cleanup. But that doesn't seem like a good
> idea, because then if startup fails after we remove the old shared
> memory segment and before we get to the DSM initialization code, we'll
> have lost the information on what control segment needs to be cleaned
> up. A subsequent startup attempt won't see the old shm again, because
> it's already gone. I'm fairly sure that this would be a net reduction
> in reliability vs. the status quo.
>
> So now what I'm thinking is that we ought to actually perform the DSM
> cleanup before detaching the old segment and trying to remove it.
> That shouldn't fail, but if it does, or if we get killed before
> completing it, the next run will hopefully find the same old shm and
> finish the cleanup. But that kind of flies in the face of the comment
> above: if we perform DSM cleanup and then discover that the segment
> wasn't ours after all, that means we just stomped all over somebody
> else's stuff. That's not too good. But trying to remove the segment
> first and then perform the cleanup creates a window where, if we get
> killed, the next restart will have lost information about how to
> finish cleaning up. So it seems that some kind of logic tweak is
> needed here, but I'm not sure exactly what. As I see it, the options
> are:
>
> 1. Make failure to remove the shared memory segment we thought was
> ours an error. This will definitely show up any problems, but only
> after we've burned down some other processes's dynamic shared memory
> segments. The most likely outcome is creating failure-to-start
> problems that don't exist today.
>
> 2. Make it a warning. We'll still burn down somebody else's DSMs, but
> at least we'll still start ourselves. Sadly, "WARNING: You have just
> done a bad thing. It's too late to fix it. Sorry!" is not very
> appealing.

It has long been the responsibility of PGSharedMemoryCreate() to determine
that a segment is unimportant before calling IPC_RMID. The success or failure
of IPC_RMID is an unreliable guide to the correctness of that determination.
IPC_RMID will succeed on an important segment owned by the same UID, and it
will fail if some other process removed the segment after our shmat(). Such a
failure does not impugn our having requested DSM cleanup on the basis of the
PGShmemHeader we did read, so an apologetic WARNING would be wrong.

> 3. Remove the old shared memory segment first, then perform the
> cleanup immediately afterwards. If we get killed before completing
> the cleanup, we'll leak the un-cleaned-up stuff. Decide that's OK and
> move on.

The period in which process exit would leak segments is narrow enough that
this seems fine. I see no net advantage over performing the cleanup before
shmdt(), though.

> 4. Adopt some sort of belt-and-suspenders approach, keeping the state
> file we have now and backstopping it with this mechanism, so that we
> really only need this to work when $PGDATA has been blown away and
> recreated. This seems pretty inelegant, and I'm not sure who it'll
> benefit other than those (few?) people who kill -9 the postmaster (or
> it segfaults or otherwise dies without running the code to remove
> shared memory segments) and then remove $PGDATA and then re-initdb and
> then expect cleanup to find the old DSMs.

Independent of the choice we make here, I would keep the state file. POSIX
permits shm_open() segments to survive reboots, so the state file would come
into play after crashes on such systems. Also, folks who use "ipcrm" after a
"kill -9" of the postmaster would benefit from the state file. Nobody should
do that, but hey, catering to things nobody should do is what brought us here.

An option I had envisioned was to make PGSharedMemoryCreate() create a state
file based on the old sysv segment. Then the later dsm_postmaster_startup()
would proceed normally, and a restart after a failure between shmdt() and
dsm_postmaster_startup() would also do the cleanup. A wrinkle comes from the
fact that an existing state file could reference a different control segment;
this would happen if we picked a different sysv shm key compared to the last
postmaster start. In that case, we should presumably clean both control
segments. Performing the cleanup per #1/#2 achieves that.

> 5. Give up on this approach. We could keep what we have now, or make
> the DSM control segment land at a well-known address as we do for the
> main segment.

How would having the DSM control segment at a well-known address affect the
problem at hand? Did you mean a well-known dsm_handle?

> What do people prefer?

I recommend performing cleanup on the control segment named in PGShmemHeader
just before shmdt() in PGSharedMemoryCreate(). No new ERROR or WARNING sites
are necessary. Have dsm_postmaster_startup() continue to perform a cleanup on
the control segment named in the state file.

> The other thing I'm a bit squidgy about is injecting more code that
> runs before we get the new shared memory segment up and running.
> During that time, we don't have effective mutual exclusion, so it's
> possible that someone else could be trying to do cleanup at the same
> time we are. That should be OK as far as the DSM code itself is
> concerned, but there may be other downsides to lengthening the period
> of time that's required to get mutual exclusion in place that I should
> be worrying about.

I see no general need to avoid small increases to that time period.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-01-21 20:00:59 Re: inherit support for foreign tables
Previous Message Simon Riggs 2014-01-21 19:48:41 Re: Add min and max execute statement time in pg_stat_statement