Re: "could not reattach to shared memory" captured in buildfarm

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org, Dave Page <dpage(at)postgreSQL(dot)org>
Subject: "could not reattach to shared memory" captured in buildfarm
Date: 2009-05-02 15:21:21
Message-ID: 15386.1241277681@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

vaquita has an interesting report today:
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=vaquita&dt=2009-05-01%2020:00:06

Partway through the contrib tests, for absolutely no visible reason
whatsoever, connections start to fail with
FATAL: could not reattach to shared memory (key=364, addr=02920000): 487

We've certainly heard more than a couple of field reports of this from
Windows users, but I don't think we've ever seen it in the buildfarm
before. (I don't see any similar instances in vaquita's history, anyway.)

I assume vaquita's configuration hasn't changed recently (Dave?)
so this seems to put the lie to the theory we've taken refuge in
that it's caused by bad antivirus software. I don't see that it
gets us any closer to a solution though.

regards, tom lane


From: Dave Page <dpage(at)postgresql(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: "could not reattach to shared memory" captured in buildfarm
Date: 2009-05-02 17:01:07
Message-ID: 937d27e10905021001k43c524c1wfc476934626aca5f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, May 2, 2009 at 4:21 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I assume vaquita's configuration hasn't changed recently (Dave?)
> so this seems to put the lie to the theory we've taken refuge in
> that it's caused by bad antivirus software.  I don't see that it
> gets us any closer to a solution though.

Well, theres a bit of a story there. Vaquita and Baiji are both the
same Vista machine running on VMware Server. About a month back, for
what seemed like no reason, the guest VM started running at much
higher speed than it should - animated cursors started running at
double speed, double-clicking become impossible and the clock started
gaining significant amounts of time - to the expent that buildfarm
runs were rejected by the server because the finish time was in the
future.

I believe I finally fixed this on Friday - from what I can tell, it
looks like the Java self-update applet was causing the clock rate on
the host to be raised to 1000/1024Hz (this can be done using the
multimedia API). This in turn was apparently upsetting VMware. Anyway,
long story short, removed the JVM from the host and everything appears
to have returned to normal. Nothing has changed in the config of the
VM itself, though a couple of minor tweaks were made to the VMware
configuration - but they were clock-related.

--
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org, Dave Page <dpage(at)postgreSQL(dot)org>
Subject: Re: "could not reattach to shared memory" captured in buildfarm
Date: 2009-05-04 08:29:12
Message-ID: 49FEA758.80605@hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> vaquita has an interesting report today:
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=vaquita&dt=2009-05-01%2020:00:06
>
> Partway through the contrib tests, for absolutely no visible reason
> whatsoever, connections start to fail with
> FATAL: could not reattach to shared memory (key=364, addr=02920000): 487

Note that 487 is "invalid address", and should not have anything to do
with the issues Andrew mentioned (which were about the already-exists
error).

Somebody else mentioned, and IIRC I talked to Dave about this before,
that this could be because the address is no longer available. The
reason for this could be some kind of race condition in the backends
starting - the address is available when the postmaster starts and thus
it's used, but when a regular backend starts, the memory is used for
something else.

One proposed fix is to allocate a fairly large block of memory in the
postmaster just before we get the shared memory, and then free it right
away. The effect should be to push down the shared memory segment
further in the address space.

Comments?

//Magnus


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Dave Page <dpage(at)postgresql(dot)org>
Subject: Re: "could not reattach to shared memory" captured in buildfarm
Date: 2009-05-04 12:57:35
Message-ID: 27656.1241441855@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander <magnus(at)hagander(dot)net> writes:
> Somebody else mentioned, and IIRC I talked to Dave about this before,
> that this could be because the address is no longer available. The
> reason for this could be some kind of race condition in the backends
> starting - the address is available when the postmaster starts and thus
> it's used, but when a regular backend starts, the memory is used for
> something else.

How is it no longer available, when the new backend is a brand new
process? The "race condition" bit seems even sillier --- if there
are multiple backends starting, they're each an independent process.

regards, tom lane


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Dave Page <dpage(at)postgresql(dot)org>
Subject: Re: "could not reattach to shared memory" captured in buildfarm
Date: 2009-05-04 14:17:27
Message-ID: 49FEF8F7.9040809@hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Magnus Hagander <magnus(at)hagander(dot)net> writes:
>> Somebody else mentioned, and IIRC I talked to Dave about this before,
>> that this could be because the address is no longer available. The
>> reason for this could be some kind of race condition in the backends
>> starting - the address is available when the postmaster starts and thus
>> it's used, but when a regular backend starts, the memory is used for
>> something else.
>
> How is it no longer available, when the new backend is a brand new
> process? The "race condition" bit seems even sillier --- if there
> are multiple backends starting, they're each an independent process.

Because some other DLL that was loaded on process startup allocated
memory differently - in a different order, different size because or
something, or something like that.

I didn't mean race condition between backends. I meant against a
potential other thread started by a loaded DLL for initialization.
(Again, things like antivirus are known to do this, and we do see these
issues more often if AV is present for example)

//Magnus


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Dave Page <dpage(at)postgresql(dot)org>
Subject: Re: "could not reattach to shared memory" captured in buildfarm
Date: 2009-05-05 01:38:10
Message-ID: 20090505013810.GD3476@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander wrote:

> I didn't mean race condition between backends. I meant against a
> potential other thread started by a loaded DLL for initialization.
> (Again, things like antivirus are known to do this, and we do see these
> issues more often if AV is present for example)

I don't understand this. How can memory allocated by a completely separate
process affect what happens to a backend? I mean, if an antivirus is running,
surely it does not run on the backend's process? Or does it?

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Dave Page <dpage(at)postgresql(dot)org>
Subject: Re: "could not reattach to shared memory" captured in buildfarm
Date: 2009-05-05 09:53:40
Message-ID: 4A000CA4.9@hagander.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera wrote:
> Magnus Hagander wrote:
>
>> I didn't mean race condition between backends. I meant against a
>> potential other thread started by a loaded DLL for initialization.
>> (Again, things like antivirus are known to do this, and we do see these
>> issues more often if AV is present for example)
>
> I don't understand this. How can memory allocated by a completely separate
> process affect what happens to a backend? I mean, if an antivirus is running,
> surely it does not run on the backend's process? Or does it?

Anti[something] software regularly injects code into other processes,
yes. Either by creating a thread in the process using
CreateRemoteThread() or by using techniques similar to LD_PRELOAD.

//Magnus


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Dave Page <dpage(at)postgresql(dot)org>
Subject: Re: "could not reattach to shared memory" captured in buildfarm
Date: 2009-05-05 22:15:21
Message-ID: 13271.1241561721@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander <magnus(at)hagander(dot)net> writes:
> One proposed fix is to allocate a fairly large block of memory in the
> postmaster just before we get the shared memory, and then free it right
> away. The effect should be to push down the shared memory segment
> further in the address space.

I have no enthusiasm for doing something like this when we have so
little knowledge of what's actually happening. We have *no* idea
whether the above could help, or what size of allocation to request.
It's not very hard to imagine that the wrong size choice could make
things worse rather than better.

It seems to me that what we ought to do now is make a serious effort
to gather more data. I came across a suggestion that one could use
VirtualQuery() to generate a map of the process address space
under Windows. I suggest that we add some code that is executed
if the reattach attempt fails and dumps the process address space
details to the postmaster log. Dumping the postmaster's address
space at the time it successfully creates the shmem segment might
be useful for comparison, too.

(A quick look at the VirtualQuery spec indicates that you can't tell
very much beyond free/allocated status, though. Maybe there's some
other call that would tell more? It'd be really good if we could get
the names of DLLs occupying memory ranges, for example.)

regards, tom lane