Re: FATAL: could not reattach to shared memory (Win32)

From: "Trevor Talbot" <quension(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Magnus Hagander" <magnus(at)hagander(dot)net>, "Shelby Cain" <alyandon(at)yahoo(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Terry Yapt" <yapt(at)technovell(dot)com>
Subject: Re: FATAL: could not reattach to shared memory (Win32)
Date: 2007-08-26 15:06:52
Message-ID: 90bce5730708260806o2b8afa60q3dd5a33567e2848a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 8/24/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Trevor Talbot" <quension(at)gmail(dot)com> writes:
> > On 8/23/07, Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> >> Not that wild a guess, really :-) I'd say it's a very good possibility -
> >> but I have no idea why it'd do that, since all backends load the same
> >> DLLs at that stage.
>
> > Not a valid assumption; you can't rely on consistent VM space among
> > multiple [non-cloned] processes without a serious amount of effort.
>
> I'm not sure if you have a specific technical meaning of "clone" in mind
> here, but these processes are all executing the identical executable,
> and taking care to map the shmem early in execution *before* they load
> any DLLs. So it should work. Apparently, it *does* work for awhile for
> the OP, and then stops working, which is even odder.

"Clone" in the same sense as fork(): duplicating a process instead of
regenerating it. Even ignoring things like DLL replacement and
LD_PRELOAD-style options, there's still a lot of opportunity for
dynamic behavior. All DLLs have an initialization routine called by
the loader (and on thread creation), which tends to be used to set up
things you don't want the caller to have to explicitly initialize.
DLLs that maintain global state they share with copies of themselves
in other processes can set up shared memory etc to do that. They can
easily change their behavior based on the environment at the time of
process start.

There are also all the hooks for extension points, such as Winsock
LSPs. Most such things happen only after an explicit initialization
(e.g. WSAStartup() or socket creation in the Winsock case), but
between the C runtime and third-party libraries, it may be happening
when you don't expect it.

All that said, I don't actually have a real-world example of process
VM layout changing like this, especially since you are using it early
to avoid this very problem. I'd love to find out exactly what's going
on in Terry's case, but I haven't come up with a good way to do it
that doesn't disturb his production environment.

> If you've got a specific suggestion for making it more reliable,
> we're all ears.

To elaborate on what I said earlier, internal_forkexec() creates the
process suspended; while it has an execution environment set up, the
loader hasn't done all the DLL linking and initialization yet, so the
address space is relatively untouched. At that point you could use
VirtualAllocEx() to reserve VM space for the shared memory at the
right address, and proceed with the rest of the setup. When the new
backend starts up, it would then VirtualFree() that space immediately
before calling MapViewOfFileEx() on it.

I can probably set up with the 8.3 tree and MSVC to create an
artificial failure, and play with the above as a fix, but I'm not
quite sure when that will be. There's still the issue of verifying it
is the problem on Terry's machine, and figuring out a fix for him.

On 8/24/07, Terry Yapt <yapt(at)technovell(dot)com> wrote:

> Yes, the windows system log (application log section) doesn't show any
> error in several days. Suddenly errors bring back to life and syslog
> errors repeats every few time. But again errors disappears and return
> in a few hours. After few hours the system goes out.
>
> Curiosity:
> ======
> On the log lines I have and I sent to the list: * FATAL: could not
> reattach to shared memory (key=5432001, addr=01D80000): Invalid argument
> , this one: "addr=01D80000" is always the same in spite of the system
> have been shutting down and restarted or the error was out for a days.

The environment is consistent then. Whatever is going on, when
postgres first starts things are normal, something just changes later
and the change is temporary. As vague guides, I would look at some
kind of global resource usage/tracking, and scheduled tasks. Do you
see any patterns about WHEN this happens? During high load periods?
Any antivirus or other security type tasks running on the machine?
Any third-party VPN type software? Fast User Switching or Remote
Desktop use?

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Kevin Kempter 2007-08-26 15:45:01 Re: SQL Diff ?
Previous Message Dawid Kuroczko 2007-08-26 14:02:02 Re: SQL Diff ?