Quick Links

Re: dynamic shared memory

Lists:	pgsql-hackers

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	dynamic shared memory
Date:	2013-08-14 01:09:06
Message-ID:	CA+TgmoaDqDUgt=4Zs_QPOnBt=EstEaVNP+5t+m=FPNWshiPR3A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Please find attached a first version of a patch to allow additional
"dynamic" shared memory segments; that is, shared memory segments that
are created after server startup, live for a period of time, and are
then destroyed when no longer needed. The main purpose of this patch
is to facilitate parallel query: if we've got multiple backends
working on the same query, they're going to need a way to communicate.
Doing that through the main shared memory segment seems infeasible
because we could, for some applications, need to share very large
amounts of data. For example, for internal sort, we basically load
the data to be sorted into memory and then rearrange an array of
pointers to the items being sorted. For parallel internal sort, we
might want to do much the same thing, but with different backend
processes manipulating different parts of the array. I'm not exactly
sure how that's going to work out yet in detail, but it seems fair to
say that the amount of data we want to share between processes there
could be quite a bit larger than anything we'd feel comfortable
nailing down in the permanent shared memory segment. Other cases,
like parallel sequential scan, might require much smaller buffers,
since there might not be much point in letting the scan get too far
ahead if nothing's consuming the tuples it produces. With this
infrastructure, we can choose at run-time exactly how much memory to
allocate for a particular purpose and return it to the operating
system as soon as we're done with it.

Creating a shared memory segment is a somewhat operating-system
dependent task. I decided that it would be smart to support several
different implementations and to let the user choose which one they'd
like to use via a new GUC, dynamic_shared_memory_type. Since we
currently require System V shared memory to be supported on all
platforms other than Windows, I have included a System V
implementation (shmget, shmctl, shmat, shmdt). However, as we know,
on many systems, System V shared memory limits are often low out of
the box and raising them is an annoyance for users. Therefore, I've
included an implementation based on POSIX shared memory facilities
(shm_open, shm_unlink), which is the default on systems where those
facilities are supported (some of the BSDs do not, I believe). We
will also need a Windows implementation, which I have not attempted,
but one of my colleagues at EnterpriseDB will be filling in that gap.

In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
debugging. On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments. Another use case
is working around an administrator-imposed or OS-imposed shared memory
limit. If you're not allowed to allocate shared memory, but you are
allowed to create files, then this implementation will let you use
whatever facilities we build on top of dynamic shared memory anyway.
A third possible reason to use this implementation is
compartmentalization. For example, you can put the directory that
stores the dynamic shared memory segments on a RAM disk - which
removes the performance concern - and then do whatever you like with
that directory: secure it, put filesystem quotas on it, or sprinkle
magic pixie dust on it. It doesn't even seem out of the question that
there might be cases where there are multiple RAM disks present with
different performance characteristics (e.g. on NUMA machines) and this
would provide fine-grained control over where your shared memory
segments get placed. To make a long story short, I won't be crushed
if the consensus is against including this, but I think it's useful.

Other implementations are imaginable but not implemented here. For
example, you can imagine using the mmap() of an anonymous file.
However, since the point is that these segments are created on the fly
by individual backends and then shared with other backends, that gets
a little tricky. In order for the second backend to map the same
anonymous shared memory segment that the first one mapped, you'd have
to pass the file descriptor from one process to the other. There are
ways, on most if not all platforms, to pass file descriptors through
sockets, but there's not automatically a socket connection between the
two processes either, so it gets hairy to think about making this
work. I did, however, include a "none" implementation which has the
effect of shutting the facility off altogether.

The actual implementation is split up into two layers. dsm_impl.c/h
encapsulate the implementation-dependent functionality at a very raw
level, while dsm.c/h wrap that functionality in a more palatable API.
Most of that wrapper layer is concerned with just one problem:
avoiding leaks. This turned out to require multiple levels of
safeguards, which I duly implemented. First, dynamic shared memory
segments need to be reference-counted, so that when the last mapping
is removed, the segment automatically goes away (we could allow for
server-lifespan segments as well with only trivial changes, but I'm
not sure whether there are compelling use cases for that). If a
backend is terminated uncleanly, the postmaster needs to remove all
leftover segments during the crash-and-restart process, just as it
needs to reinitialize the main shared memory segment. And if all
processes are terminated uncleanly, the next postmaster startup needs
to clean up any segments that still exist, again just as we already do
for the main shared memory segment. Neither POSIX shared memory nor
System V shared memory provide an API for enumerating all existing
shared memory segments, so we must keep track ourselves of what we
have outstanding. Second, we need to ensure, within the scope of an
individual process, that we only retain a mapping for as long as
necessary. Just as memory contexts, locks, buffer pins, and other
resources automatically go away at the end of a query or
(sub)transaction, dynamic shared memory mappings created for a purpose
such as parallel sort need to go away if we abort mid-way through. Of
course, if you have a user backend coordinating with workers, it seems
pretty likely that the workers are just going to exit if they hit an
error, so having the mapping be process-lifetime wouldn't necessarily
be a big deal; but the user backend may stick around for a long time
and execute other queries, and we can't afford to have it accumulate
mappings, not least because that's equivalent to a session-lifespan
memory leak.

To help solve these problems, I invented something called the "dynamic
shared memory control segment". This is a dynamic shared memory
segment created at startup (or reinitialization) time by the
postmaster before any user process are created. It is used to store a
list of the identities of all the other dynamic shared memory segments
we have outstanding and the reference count of each. If the
postmaster goes through a crash-and-reset cycle, it scans the control
segment and removes all the other segments mentioned there, and then
recreates the control segment itself. If the postmaster is killed off
(e.g. kill -9) and restarted, it locates the old control segment and
proceeds similarly. If the whole operating system is rebooted, the
old control segment won't exist any more, but that's OK, because none
of the other segments will either - except under the
mmap-a-regular-file implementation, which handles cleanup by scanning
the relevant directory rather than relying on the control segment.
These precautions seem sufficient to ensure that dynamic shared memory
segments can't survive the postmaster itself short of a hard kill, and
that even after a hard kill we'll clean things up on a subsequent
postmaster startup. The other problem, of making sure that segments
get unmapped at the proper time, is solved using the resource owner
mechanism. There is an API to create a mapping which is
session-lifespan rather than resource-owner lifespan, but the default
is resource-owner lifespan, which I suspect will be right for common
uses. Thus, there are four separate occasions on which we remove
shared memory segments: (1) resource owner cleanup, (2) backend exit
(for any session-lifespan mappings and anything else that slips
through the cracks), (3) postmaster exit (in case a child dies without
cleaning itself up), and (4) postmaster startup (in case the
postmaster dies without cleaning up).

There are quite a few problems that this patch does not solve. First,
while it does give you a shared memory segment, it doesn't provide you
with any help at all in figuring out what to put in that segment. The
task of figuring out how to communicate usefully through shared memory
is thus, for the moment, left entirely to the application programmer.
While there may be cases where that's just right, I suspect there will
be a wider range of cases where it isn't, and I plan to work on some
additional facilities, sitting on top of this basic structure, next,
though probably as a separate patch. Second, it doesn't make any
policy decisions about what is sensible either in terms of number of
shared memory segments or the sizes of those segments, even though
there are serious practical limits in both cases. Actually, the total
number of segments system-wide is limited by the size of the control
segment, which is sized based on MaxBackends. But there's nothing to
keep a single backend from eating up all the slots, even though that's
pretty both unfriendly and unportable, and there's no real limit to
the amount of memory it can gobble up per slot, either. In other
words, it would be a bad idea to write a contrib module that exposes a
relatively uncooked version of this layer to the user.

But, just for testing purposes, I did just that. The attached patch
includes contrib/dsm_demo, which lets you say
dsm_demo_create('something') in one string, and if you pass the return
value to dsm_demo_read() in the same or another session during the
lifetime of the first session, you'll read back the same value you
saved. This is not, by any stretch of the imagination, a
demonstration of the right way to use this facility - but as a crude
unit test, it suffices. Although I'm including it in the patch file,
I would anticipate removing it before commit. Hopefully, with a
little more functionality on top of what's included here, we'll soon
be in a position to build something that might actually be useful to
someone, but this layer itself is a bit too impoverished to build
something really cool, at least not without more work than I wanted to
put in as part of the development of this patch.

Using that crappy contrib module, I verified that the POSIX, System V,
and mmap implementations all work on my MacBook Pro (OS X 10.8.4) and
on Linux (Fedora 16). I wouldn't like to have to wager on having
gotten all of the details right to be absolutely portable everywhere,
so I wouldn't be surprised to see this break on other systems.
Hopefully that will be a matter of adjusting the configure tests a bit
rather than coping with substantive changes in available
functionality, but we'll see.

Finally, I'd like to thank Noah Misch for a lot of discussion and
thought on that enabled me to make this patch much better than it
otherwise would have been. Although I didn't adopt Noah's preferred
solutions to all of the problems, and although there are probably
still some problems buried here, there would have been more if not for
his advice. I'd also like to thank the entire database server team at
EnterpriseDB for allowing me to dump large piles of work on them so
that I could work on this, and my boss, Tom Kincaid, for not allowing
other people to dump large piles of work on me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
dynshmem-v1.patch	application/octet-stream	78.3 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-08-27 14:07:33
Message-ID:	20130827140733.GD24807@alap2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Robert,

[just sending an email which sat in my outbox for two weeks]

On 2013-08-13 21:09:06 -0400, Robert Haas wrote:
> ...

Nice to see this coming. I think it will actually be interesting for
quite some things outside parallel query, but we'll see.

I've not yet looked at the code, so I just have some highlevel comments
so far.

> To help solve these problems, I invented something called the "dynamic
> shared memory control segment". This is a dynamic shared memory
> segment created at startup (or reinitialization) time by the
> postmaster before any user process are created. It is used to store a
> list of the identities of all the other dynamic shared memory segments
> we have outstanding and the reference count of each. If the
> postmaster goes through a crash-and-reset cycle, it scans the control
> segment and removes all the other segments mentioned there, and then
> recreates the control segment itself. If the postmaster is killed off
> (e.g. kill -9) and restarted, it locates the old control segment and
> proceeds similarly.

That way any corruption in that area will prevent restarts without
reboot unless you use ipcrm, or such, right?

> Creating a shared memory segment is a somewhat operating-system
> dependent task. I decided that it would be smart to support several
> different implementations and to let the user choose which one they'd
> like to use via a new GUC, dynamic_shared_memory_type.

I think we want that during development, but I'd rather not go there
when releasing. After all, we don't support a manual choice between
anonymous mmap/sysv shmem either.

> In addition, I've included an implementation based on mmap of a plain
> file. As compared with a true shared memory implementation, this
> obviously has the disadvantage that the OS may be more likely to
> decide to write back dirty pages to disk, which could hurt
> performance. However, I believe it's worthy of inclusion all the
> same, because there are a variety of situations in which it might be
> more convenient than one of the other implementations. One is
> debugging.

Hm. Not sure what's the advantage over a corefile here.

> On MacOS X, for example, there seems to be no way to list
> POSIX shared memory segments, and no easy way to inspect the contents
> of either POSIX or System V shared memory segments.

Shouldn't we ourselves know which segments are around?

> Another use case
> is working around an administrator-imposed or OS-imposed shared memory
> limit. If you're not allowed to allocate shared memory, but you are
> allowed to create files, then this implementation will let you use
> whatever facilities we build on top of dynamic shared memory anyway.

I don't think we should try to work around limits like that.

> A third possible reason to use this implementation is
> compartmentalization. For example, you can put the directory that
> stores the dynamic shared memory segments on a RAM disk - which
> removes the performance concern - and then do whatever you like with
> that directory: secure it, put filesystem quotas on it, or sprinkle
> magic pixie dust on it. It doesn't even seem out of the question that
> there might be cases where there are multiple RAM disks present with
> different performance characteristics (e.g. on NUMA machines) and this
> would provide fine-grained control over where your shared memory
> segments get placed. To make a long story short, I won't be crushed
> if the consensus is against including this, but I think it's useful.

-1 so far. Seems a bit handwavy to me.

> Other implementations are imaginable but not implemented here. For
> example, you can imagine using the mmap() of an anonymous file.
> However, since the point is that these segments are created on the fly
> by individual backends and then shared with other backends, that gets
> a little tricky. In order for the second backend to map the same
> anonymous shared memory segment that the first one mapped, you'd have
> to pass the file descriptor from one process to the other.

It wouldn't even work. Several mappings of /dev/zero et al. do *not*
result in the same virtual memory being mapped. Not even when using the
same (passed around) fd.
Believe me, I tried ;)

> There are quite a few problems that this patch does not solve. First,
> while it does give you a shared memory segment, it doesn't provide you
> with any help at all in figuring out what to put in that segment. The
> task of figuring out how to communicate usefully through shared memory
> is thus, for the moment, left entirely to the application programmer.
> While there may be cases where that's just right, I suspect there will
> be a wider range of cases where it isn't, and I plan to work on some
> additional facilities, sitting on top of this basic structure, next,
> though probably as a separate patch.

Agreed.

> Second, it doesn't make any> policy decisions about what is sensible either in terms of number of
> shared memory segments or the sizes of those segments, even though
> there are serious practical limits in both cases. Actually, the total
> number of segments system-wide is limited by the size of the control
> segment, which is sized based on MaxBackends. But there's nothing to
> keep a single backend from eating up all the slots, even though that's
> pretty both unfriendly and unportable, and there's no real limit to
> the amount of memory it can gobble up per slot, either. In other
> words, it would be a bad idea to write a contrib module that exposes a
> relatively uncooked version of this layer to the user.

At this point I am rather unconcerned with this point to be
honest.

> --- /dev/null
> +++ b/src/include/storage/dsm.h
> @@ -0,0 +1,40 @@
> +/*-------------------------------------------------------------------------
> + *
> + * dsm.h
> + * manage dynamic shared memory segments
> + *
> + * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + * src/include/storage/dsm.h
> + *
> + *-------------------------------------------------------------------------
> + */
> +#ifndef DSM_H
> +#define DSM_H
> +
> +#include "storage/dsm_impl.h"
> +
> +typedef struct dsm_segment dsm_segment;
> +
> +/* Initialization function. */
> +extern void dsm_postmaster_startup(void);
> +
> +/* Functions that create, update, or remove mappings. */
> +extern dsm_segment *dsm_create(uint64 size, char *preferred_address);
> +extern dsm_segment *dsm_attach(dsm_handle h, char *preferred_address);
> +extern void *dsm_resize(dsm_segment *seg, uint64 size,
> + char *preferred_address);
> +extern void *dsm_remap(dsm_segment *seg, char *preferred_address);
> +extern void dsm_detach(dsm_segment *seg);

Why do we want to expose something unreliable as preferred_address to
the external interface? I haven't read the code yet, so I might be
missing something here.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-08-28 19:20:57
Message-ID:	CA+Tgmob8vX+zCoxnif-SXzpZUVfQpcMBec6oV4Pg7p+VUHK+tw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 27, 2013 at 10:07 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> [just sending an email which sat in my outbox for two weeks]

Thanks for taking a look.

> Nice to see this coming. I think it will actually be interesting for
> quite some things outside parallel query, but we'll see.

Yeah, I hope so. The applications may be somewhat limited by the fact
that there are apparently fairly small limits to how many shared
memory segments you can map at the same time. I believe on one system
I looked at (some version of HP-UX?) the limit was 11. So we won't be
able to go nuts with this: using it definitely introduces all kinds of
failure modes that we don't have it today. But it will also let us do
some pretty cool things that we CAN'T do today.

>> To help solve these problems, I invented something called the "dynamic
>> shared memory control segment". This is a dynamic shared memory
>> segment created at startup (or reinitialization) time by the
>> postmaster before any user process are created. It is used to store a
>> list of the identities of all the other dynamic shared memory segments
>> we have outstanding and the reference count of each. If the
>> postmaster goes through a crash-and-reset cycle, it scans the control
>> segment and removes all the other segments mentioned there, and then
>> recreates the control segment itself. If the postmaster is killed off
>> (e.g. kill -9) and restarted, it locates the old control segment and
>> proceeds similarly.
>
> That way any corruption in that area will prevent restarts without
> reboot unless you use ipcrm, or such, right?

The way I've designed it, no. If what we expect to be the control
segment doesn't exist or doesn't conform to our expectations, we just
assume that it's not really the control segment after all - e.g.
someone rebooted, clearing all the segments, and then an unrelated
process (malicious, perhaps, or just a completely different cluster)
reused the same name. This is similar to what we do for the main
shared memory segment.

>> Creating a shared memory segment is a somewhat operating-system
>> dependent task. I decided that it would be smart to support several
>> different implementations and to let the user choose which one they'd
>> like to use via a new GUC, dynamic_shared_memory_type.
>
> I think we want that during development, but I'd rather not go there
> when releasing. After all, we don't support a manual choice between
> anonymous mmap/sysv shmem either.

That's true, but that decision has not been uncontroversial - e.g. the
NetBSD guys don't like it, because they have a big performance
difference between those two types of memory. We have to balance the
possible harm of one more setting against the benefit of letting
people do what they want without needing to recompile or modify code.

>> In addition, I've included an implementation based on mmap of a plain
>> file. As compared with a true shared memory implementation, this
>> obviously has the disadvantage that the OS may be more likely to
>> decide to write back dirty pages to disk, which could hurt
>> performance. However, I believe it's worthy of inclusion all the
>> same, because there are a variety of situations in which it might be
>> more convenient than one of the other implementations. One is
>> debugging.
>
> Hm. Not sure what's the advantage over a corefile here.

You can look at it while the server's running.

>> On MacOS X, for example, there seems to be no way to list
>> POSIX shared memory segments, and no easy way to inspect the contents
>> of either POSIX or System V shared memory segments.
>
> Shouldn't we ourselves know which segments are around?

Sure, that's the point of the control segment. But listing a
directory is a lot easier than figuring out what the current control
segment contents are.

>> Another use case
>> is working around an administrator-imposed or OS-imposed shared memory
>> limit. If you're not allowed to allocate shared memory, but you are
>> allowed to create files, then this implementation will let you use
>> whatever facilities we build on top of dynamic shared memory anyway.
>
> I don't think we should try to work around limits like that.

I do. There's probably someone, somewhere in the world who thinks
that operating system shared memory limits are a good idea, but I have
not met any such person. There are multiple ways to create shared
memory, and they all have different limits. Normally, System V limits
are small, POSIX limits are large, and the inherited-anonymous-mapping
trick we're now using for the main shared memory segment has no limits
at all. It's very common to run into a system where you can allocate
huge numbers of gigabytes of backend-private memory, but if you try to
allocate 64MB of *shared* memory, you get the axe - or maybe not,
depending on which API you use to create it.

I would never advocate deliberately trying to circumvent a
carefully-considered OS-level policy decision about resource
utilization, but I don't think that's the dynamic here. I think if we
insist on predetermining the dynamic shared memory implementation
based on the OS, we'll just be inconveniencing people needlessly, or
flat-out making things not work. I think this case is roughly similar
to wal_sync_method: there really shouldn't be a performance or
reliability difference between the ~6 ways of flushing a file to disk,
but as it turns out, there is, so we have an option. If we're SURE
that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
100% of cases, and that a NetBSD user will always prefer "sysv" over
"mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
But I'm not that sure.

> It wouldn't even work. Several mappings of /dev/zero et al. do *not*
> result in the same virtual memory being mapped. Not even when using the
> same (passed around) fd.
> Believe me, I tried ;)

OK, well that's another reason I didn't do it that way, then. :-)

> At this point I am rather unconcerned with this point to be
> honest.

I think that's appropriate; mostly, I wanted to emphasize that the
wisdom of allocating any given amount of shared memory is outside the
scope of this patch, which only aims to provide mechanism, not policy.

> Why do we want to expose something unreliable as preferred_address to
> the external interface? I haven't read the code yet, so I might be
> missing something here.

I shared your opinion that preferred_address is never going to be
reliable, although FWIW Noah thinks it can be made reliable with a
large-enough hammer. But even if it isn't reliable, there doesn't
seem to be all that much value in forbidding access to that part of
the OS-provided API. In the world where it's not reliable, it may
still be convenient to map things at the same address when you can, so
that pointers can't be used. Of course you'd have to have some
fallback strategy for when you don't get the same mapping, and maybe
that's painful enough that there's no point after all. Or maybe it's
worth having one code path for relativized pointers and another for
non-relativized pointers.

To be honest, I'm not real sure. I think it's clear enough that this
will meet the minimal requirements for parallel query - ONE dynamic
shared memory segment that's not guaranteed to be at the same address
in every backend, and can't be resized after creation. And we could
pare the API down to only support that. But I'd rather get some
experience with this first before we start taking away options.
Otherwise, we may never really find out the limits of what is possible
in this area, and I think that would be a shame.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-08-30 00:12:07
Message-ID:	521FE357.8080208@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8/13/13 8:09 PM, Robert Haas wrote:
> is removed, the segment automatically goes away (we could allow for
> server-lifespan segments as well with only trivial changes, but I'm
> not sure whether there are compelling use cases for that).

To clarify... you're talking something that would intentionally survive postmaster restart? I don't see use for that either...

> postmaster startup. The other problem, of making sure that segments
> get unmapped at the proper time, is solved using the resource owner
> mechanism. There is an API to create a mapping which is
> session-lifespan rather than resource-owner lifespan, but the default
> is resource-owner lifespan, which I suspect will be right for common
> uses. Thus, there are four separate occasions on which we remove
> shared memory segments: (1) resource owner cleanup, (2) backend exit
> (for any session-lifespan mappings and anything else that slips
> through the cracks), (3) postmaster exit (in case a child dies without
> cleaning itself up), and (4) postmaster startup (in case the
> postmaster dies without cleaning up).

Ignorant question... is ResourceOwner related to memory contexts? If not, would memory contexts be a better way to handle memory segment cleanup?

> There are quite a few problems that this patch does not solve. First,

It also doesn't provide any mechanism for notifying backends of a new segment. Arguably that's beyond the scope of dsm.c, but ISTM that it'd be useful to have a standard method or three of doing that; perhaps just some convenience functions wrapping the methods mentioned in comments.

> Finally, I'd like to thank Noah Misch for a lot of discussion and
> thought on that enabled me to make this patch much better than it
> otherwise would have been. Although I didn't adopt Noah's preferred
> solutions to all of the problems, and although there are probably
> still some problems buried here, there would have been more if not for
> his advice. I'd also like to thank the entire database server team at
> EnterpriseDB for allowing me to dump large piles of work on them so
> that I could work on this, and my boss, Tom Kincaid, for not allowing
> other people to dump large piles of work on me.

Thanks to you and the rest of the folks at EnterpriseDB... dynamic shared memory is something we've needed forever! :)

Other comments...

+ * If the state file is empty or the contents are garbled, it probably means
+ * that the operating system rebooted before the data written by the previous
+ * postmaster made it to disk. In that case, we can just ignore it; any shared
+ * memory from before the reboot should be gone anyway.

I'm a bit concerned about this; I know it was possible in older versions for the global shared memory context to be left behind after a crash and needing to clean it up by hand. Dynamic shared mem potentially multiplies that by 100 or more. I think it'd be worth changing dsm_write_state_file so it always writes a new file and then does an atomic mv (or something similar).

+ * If some other backend exited uncleanly, it might have corrupted the
+ * control segment while it was dying. In that case, we warn and ignore
+ * the contents of the control segment. This may end up leaving behind
+ * stray shared memory segments, but there's not much we can do about
+ * that if the metadata is gone.

Similar concern... in this case, would it be possible to always write updates to an un-used slot and then atomically update a pointer? This would be more work than what I suggested above, so maybe just a TODO for now...

Though... is there anything a dying backend could do that would corrupt the control segment to the point that it would screw up segments allocated by other backends and not related to the dead backend? Like marking a slot as not used when it is still in use and isn't associated with the dead backend? (I'm assuming that if a backend dies unexpectedly then all other backends using memory shared with that backend will need to handle themselves accordingly so that we don't need to worry about that in dsm.c.)

I was able to simplify dsm_create a bit (depending on your definition of simplify...) not sure if the community is OK with using an ereport to exit a loop (that could safely go outside the loop though...). In any case, I traded 5 lines of (mostly) duplicate code with an if{} and a break:

+ nitems = dsm_control->nitems;
+ for (i = 0; i <= nitems; ++i) /* Intentionally go one slot past what's currently been allocated */
+ {
+ if (dsm_control->item[i].refcnt == 0)
+ {
+ dsm_control->item[i].handle = seg->handle;
+ /* refcnt of 1 triggers destruction, so start at 2 */
+ dsm_control->item[i].refcnt = 2;
+ seg->control_slot = i;
+ if (i = nitems) /* We hit the end of the list */
+ {
+ /* Verify that we can support an additional mapping. */
+ if (nitems >= dsm_control->maxitems)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+ errmsg("too many dynamic shared memory segments")));
+
+ dsm_control->nitems++;
+ }
+ break;
+ }
+ }
+
+ LWLockRelease(DynamicSharedMemoryControlLock);
+ return seg;

Should this (in dsm_attach)

+ * If you're hitting this error, you probably want to use attempt to

+ * If you're hitting this error, you probably want to attempt to

Should dsm_impl_op sanity check the arguments after op? I didn't notice checks in the type-specific code but I also didn't read all of it... are we just depending on the OS to sanity-check?

Also, does the GUC code enforce that the GUC must always be something that's supported? If not then the error in dsm_impl_op should be more user-friendly.

I basically stopped reading after dsm_impl_op... the rest of the stuff was rather over my head.
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-08-30 15:45:39
Message-ID:	20130830154539.GK5019@alap2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2013-08-28 15:20:57 -0400, Robert Haas wrote:
> > That way any corruption in that area will prevent restarts without
> > reboot unless you use ipcrm, or such, right?
>
> The way I've designed it, no. If what we expect to be the control
> segment doesn't exist or doesn't conform to our expectations, we just
> assume that it's not really the control segment after all - e.g.
> someone rebooted, clearing all the segments, and then an unrelated
> process (malicious, perhaps, or just a completely different cluster)
> reused the same name. This is similar to what we do for the main
> shared memory segment.

The case I am mostly wondering about is some process crashing and
overwriting random memory. We need to be pretty sure that we'll never
fail partially through cleaning up old segments because they are
corrupted or because we died halfway through our last cleanup attempt.

> > I think we want that during development, but I'd rather not go there
> > when releasing. After all, we don't support a manual choice between
> > anonymous mmap/sysv shmem either.

> That's true, but that decision has not been uncontroversial - e.g. the
> NetBSD guys don't like it, because they have a big performance
> difference between those two types of memory. We have to balance the
> possible harm of one more setting against the benefit of letting
> people do what they want without needing to recompile or modify code.

But then, it made them fix the issue afaik :P

> >> In addition, I've included an implementation based on mmap of a plain
> >> file. As compared with a true shared memory implementation, this
> >> obviously has the disadvantage that the OS may be more likely to
> >> decide to write back dirty pages to disk, which could hurt
> >> performance. However, I believe it's worthy of inclusion all the
> >> same, because there are a variety of situations in which it might be
> >> more convenient than one of the other implementations. One is
> >> debugging.
> >
> > Hm. Not sure what's the advantage over a corefile here.

> You can look at it while the server's running.

That's what debuggers are for.

> >> On MacOS X, for example, there seems to be no way to list
> >> POSIX shared memory segments, and no easy way to inspect the contents
> >> of either POSIX or System V shared memory segments.

> > Shouldn't we ourselves know which segments are around?

> Sure, that's the point of the control segment. But listing a
> directory is a lot easier than figuring out what the current control
> segment contents are.

But without a good amount of tooling - like in a debugger... - it's not
very interesting to look at those files either way? The mere presence of
a segment doesn't tell you much and the contents won't be easily
readable.

> >> Another use case is working around an administrator-imposed or
> >> OS-imposed shared memory limit. If you're not allowed to allocate
> >> shared memory, but you are allowed to create files, then this
> >> implementation will let you use whatever facilities we build on top
> >> of dynamic shared memory anyway.
> >
> > I don't think we should try to work around limits like that.

> I do. There's probably someone, somewhere in the world who thinks
> that operating system shared memory limits are a good idea, but I have
> not met any such person.

"Let's drive users away from sysv shem" is the only one I heard so far ;)

> I would never advocate deliberately trying to circumvent a
> carefully-considered OS-level policy decision about resource
> utilization, but I don't think that's the dynamic here. I think if we
> insist on predetermining the dynamic shared memory implementation
> based on the OS, we'll just be inconveniencing people needlessly, or
> flat-out making things not work. [...]

But using file-backed memory will *suck* performancewise. Why should we
ever want to offer that to a user? That's what I was arguing about
primarily.

> If we're SURE
> that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
> 100% of cases, and that a NetBSD user will always prefer "sysv" over
> "mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
> But I'm not that sure.

I think posix shmem will be preferred to sysv shmem if present, in just
about any relevant case. I don't know of any system with lower limits on
posix shmem than on sysv.

> I think this case is roughly similar
> to wal_sync_method: there really shouldn't be a performance or
> reliability difference between the ~6 ways of flushing a file to disk,
> but as it turns out, there is, so we have an option.

Well, most of them actually give different guarantees, so it makes sense
to have differing performance...

> > Why do we want to expose something unreliable as preferred_address to
> > the external interface? I haven't read the code yet, so I might be
> > missing something here.

> I shared your opinion that preferred_address is never going to be
> reliable, although FWIW Noah thinks it can be made reliable with a
> large-enough hammer.

I think we need to have the arguments for that on list then. Those are
pretty damn fundamental design decisions.
I for one cannot see how you even remotely could make that work a) on
windows (check the troubles we have to go through to get s_b
consistently placed, and that's directly after startup) b) 32bit systems.

> But even if it isn't reliable, there doesn't seem to be all that much
> value in forbidding access to that part of the OS-provided API. In
> the world where it's not reliable, it may still be convenient to map
> things at the same address when you can, so that pointers can't be
> used. Of course you'd have to have some fallback strategy for when
> you don't get the same mapping, and maybe that's painful enough that
> there's no point after all. Or maybe it's worth having one code path
> for relativized pointers and another for non-relativized pointers.

It seems likely to me that will end up with untested code in that
case. Or even unsupported platforms.

> To be honest, I'm not real sure. I think it's clear enough that this
> will meet the minimal requirements for parallel query - ONE dynamic
> shared memory segment that's not guaranteed to be at the same address
> in every backend, and can't be resized after creation. And we could
> pare the API down to only support that. But I'd rather get some
> experience with this first before we start taking away options.
> Otherwise, we may never really find out the limits of what is possible
> in this area, and I think that would be a shame.

On the other hand, adding capabilities annoys people far much than
deciding that we can't support them in the end and taking them away.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-08-31 05:12:32
Message-ID:	CAA4eK1LQzophhQyEXt7WFjaPv40oOVMbiA_+7X1=PZvafpGebA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Aug 30, 2013 at 9:15 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> On 2013-08-28 15:20:57 -0400, Robert Haas wrote:
>> > That way any corruption in that area will prevent restarts without
>> > reboot unless you use ipcrm, or such, right?
>>
>> The way I've designed it, no. If what we expect to be the control
>> segment doesn't exist or doesn't conform to our expectations, we just
>> assume that it's not really the control segment after all - e.g.
>> someone rebooted, clearing all the segments, and then an unrelated
>> process (malicious, perhaps, or just a completely different cluster)
>> reused the same name. This is similar to what we do for the main
>> shared memory segment.
>
> The case I am mostly wondering about is some process crashing and
> overwriting random memory. We need to be pretty sure that we'll never
> fail partially through cleaning up old segments because they are
> corrupted or because we died halfway through our last cleanup attempt.
>
>> > I think we want that during development, but I'd rather not go there
>> > when releasing. After all, we don't support a manual choice between
>> > anonymous mmap/sysv shmem either.
>
>> That's true, but that decision has not been uncontroversial - e.g. the
>> NetBSD guys don't like it, because they have a big performance
>> difference between those two types of memory. We have to balance the
>> possible harm of one more setting against the benefit of letting
>> people do what they want without needing to recompile or modify code.
>
> But then, it made them fix the issue afaik :P
>
>> >> In addition, I've included an implementation based on mmap of a plain
>> >> file. As compared with a true shared memory implementation, this
>> >> obviously has the disadvantage that the OS may be more likely to
>> >> decide to write back dirty pages to disk, which could hurt
>> >> performance. However, I believe it's worthy of inclusion all the
>> >> same, because there are a variety of situations in which it might be
>> >> more convenient than one of the other implementations. One is
>> >> debugging.
>> >
>> > Hm. Not sure what's the advantage over a corefile here.
>
>> You can look at it while the server's running.
>
> That's what debuggers are for.
>
>> >> On MacOS X, for example, there seems to be no way to list
>> >> POSIX shared memory segments, and no easy way to inspect the contents
>> >> of either POSIX or System V shared memory segments.
>
>> > Shouldn't we ourselves know which segments are around?
>
>> Sure, that's the point of the control segment. But listing a
>> directory is a lot easier than figuring out what the current control
>> segment contents are.
>
> But without a good amount of tooling - like in a debugger... - it's not
> very interesting to look at those files either way? The mere presence of
> a segment doesn't tell you much and the contents won't be easily
> readable.
>
>> >> Another use case is working around an administrator-imposed or
>> >> OS-imposed shared memory limit. If you're not allowed to allocate
>> >> shared memory, but you are allowed to create files, then this
>> >> implementation will let you use whatever facilities we build on top
>> >> of dynamic shared memory anyway.
>> >
>> > I don't think we should try to work around limits like that.
>
>> I do. There's probably someone, somewhere in the world who thinks
>> that operating system shared memory limits are a good idea, but I have
>> not met any such person.
>
> "Let's drive users away from sysv shem" is the only one I heard so far ;)
>
>> I would never advocate deliberately trying to circumvent a
>> carefully-considered OS-level policy decision about resource
>> utilization, but I don't think that's the dynamic here. I think if we
>> insist on predetermining the dynamic shared memory implementation
>> based on the OS, we'll just be inconveniencing people needlessly, or
>> flat-out making things not work. [...]
>
> But using file-backed memory will *suck* performancewise. Why should we
> ever want to offer that to a user? That's what I was arguing about
> primarily.
>
>> If we're SURE
>> that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
>> 100% of cases, and that a NetBSD user will always prefer "sysv" over
>> "mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
>> But I'm not that sure.
>
> I think posix shmem will be preferred to sysv shmem if present, in just
> about any relevant case. I don't know of any system with lower limits on
> posix shmem than on sysv.
>
>> I think this case is roughly similar
>> to wal_sync_method: there really shouldn't be a performance or
>> reliability difference between the ~6 ways of flushing a file to disk,
>> but as it turns out, there is, so we have an option.
>
> Well, most of them actually give different guarantees, so it makes sense
> to have differing performance...
>
>> > Why do we want to expose something unreliable as preferred_address to
>> > the external interface? I haven't read the code yet, so I might be
>> > missing something here.
>
>> I shared your opinion that preferred_address is never going to be
>> reliable, although FWIW Noah thinks it can be made reliable with a
>> large-enough hammer.
>
> I think we need to have the arguments for that on list then. Those are
> pretty damn fundamental design decisions.
> I for one cannot see how you even remotely could make that work a) on
> windows (check the troubles we have to go through to get s_b
> consistently placed, and that's directly after startup) b) 32bit systems.

For Windows, I believe we are already doing something similar
(attaching at predefined address) in main shared
memory. It reserves memory at particular address using
pgwin32_ReserveSharedMemoryRegion() before actually
starting (resuming process created in suspend mode) a process and
then after starting backend attaches at same
address (PGSharedMemoryReAttach).

I think one question here is what is use of exposing
preffered_address, to which I can think of only below:

a. Base OS API's provide such provision, then why don't we?
b. While browsing, I found few examples in IBM site where they also
show usage with preferred address.
http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/index.jsp?topic=%2Fcom.ibm.vacpp7a.doc%2Fproguide%2Fref%2Fcreate_heap.htm
c. If user wishes to attach segments at same base address, so that
it can access pointers in the memory mapped
file which otherwise would not be possible.

>> But even if it isn't reliable, there doesn't seem to be all that much
>> value in forbidding access to that part of the OS-provided API. In
>> the world where it's not reliable, it may still be convenient to map
>> things at the same address when you can, so that pointers can't be
>> used. Of course you'd have to have some fallback strategy for when
>> you don't get the same mapping, and maybe that's painful enough that
>> there's no point after all. Or maybe it's worth having one code path
>> for relativized pointers and another for non-relativized pointers.
>
> It seems likely to me that will end up with untested code in that
> case. Or even unsupported platforms.
>
>> To be honest, I'm not real sure. I think it's clear enough that this
>> will meet the minimal requirements for parallel query - ONE dynamic
>> shared memory segment that's not guaranteed to be at the same address
>> in every backend, and can't be resized after creation. And we could
>> pare the API down to only support that. But I'd rather get some
>> experience with this first before we start taking away options.
>> Otherwise, we may never really find out the limits of what is possible
>> in this area, and I think that would be a shame.
>
> On the other hand, adding capabilities annoys people far much than
> deciding that we can't support them in the end and taking them away.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-08-31 12:17:50
Message-ID:	CA+TgmobmkQb3E63FVUhBKhgj3ST5J+EpWQ3e4+eVqo9xh8SmCA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 29, 2013 at 8:12 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> On 8/13/13 8:09 PM, Robert Haas wrote:
>> is removed, the segment automatically goes away (we could allow for
>> server-lifespan segments as well with only trivial changes, but I'm
>> not sure whether there are compelling use cases for that).
>
> To clarify... you're talking something that would intentionally survive
> postmaster restart? I don't see use for that either...

No, I meant something that would live as long as the postmaster and
die when it dies.

> Ignorant question... is ResourceOwner related to memory contexts? If not,
> would memory contexts be a better way to handle memory segment cleanup?

Nope. :-)

>> There are quite a few problems that this patch does not solve. First,
>
> It also doesn't provide any mechanism for notifying backends of a new
> segment. Arguably that's beyond the scope of dsm.c, but ISTM that it'd be
> useful to have a standard method or three of doing that; perhaps just some
> convenience functions wrapping the methods mentioned in comments.

I don't see that as being generally useful. Backends need to know
more than "there's a new segment", and in fact most backends won't
care about most new segments. A background worker needs to know about
the new segment *that it should attach*, but we have bgw_main_arg. If
we end up using this facility for any system-wide purposes, I imagine
we'll do that by storing the segment ID in the main shared memory
segment someplace.

> Thanks to you and the rest of the folks at EnterpriseDB... dynamic shared
> memory is something we've needed forever! :)

Thanks.

> Other comments...
>
> + * If the state file is empty or the contents are garbled, it probably
> means
> + * that the operating system rebooted before the data written by the
> previous
> + * postmaster made it to disk. In that case, we can just ignore it; any
> shared
> + * memory from before the reboot should be gone anyway.
>
> I'm a bit concerned about this; I know it was possible in older versions for
> the global shared memory context to be left behind after a crash and needing
> to clean it up by hand. Dynamic shared mem potentially multiplies that by
> 100 or more. I think it'd be worth changing dsm_write_state_file so it
> always writes a new file and then does an atomic mv (or something similar).

I agree that the possibilities for leftover shared memory segments are
multiplied with this new facility, and I've done my best to address
that. However, I don't agree that writing the state file in a
different way would improve anything.

> + * If some other backend exited uncleanly, it might have corrupted
> the
> + * control segment while it was dying. In that case, we warn and
> ignore
> + * the contents of the control segment. This may end up leaving
> behind
> + * stray shared memory segments, but there's not much we can do
> about
> + * that if the metadata is gone.
>
> Similar concern... in this case, would it be possible to always write
> updates to an un-used slot and then atomically update a pointer? This would
> be more work than what I suggested above, so maybe just a TODO for now...
>
> Though... is there anything a dying backend could do that would corrupt the
> control segment to the point that it would screw up segments allocated by
> other backends and not related to the dead backend? Like marking a slot as
> not used when it is still in use and isn't associated with the dead backend?

Sure. A messed-up backend can clobber the control segment just as it
can clobber anything else in shared memory. There's really no way
around that problem. If the control segment has been overwritten by a
memory stomp, we can't use it to clean up. There's no way around that
problem except to not the control segment, which wouldn't be better.

> Should this (in dsm_attach)
>
> + * If you're hitting this error, you probably want to use attempt to
>
> be
>
> + * If you're hitting this error, you probably want to attempt to
>
> ?

Good point.

> Should dsm_impl_op sanity check the arguments after op? I didn't notice
> checks in the type-specific code but I also didn't read all of it... are we
> just depending on the OS to sanity-check?

Sanity-check for what?

> Also, does the GUC code enforce that the GUC must always be something that's
> supported? If not then the error in dsm_impl_op should be more
> user-friendly.

Yes.

> I basically stopped reading after dsm_impl_op... the rest of the stuff was
> rather over my head.

:-)

Thanks for your interest!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-08-31 12:27:14
Message-ID:	CA+Tgmoa-8gett_hkNsQ9r04rVW-hoAzTobQr3SCGKocRo=7tUg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Aug 30, 2013 at 11:45 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> The way I've designed it, no. If what we expect to be the control
>> segment doesn't exist or doesn't conform to our expectations, we just
>> assume that it's not really the control segment after all - e.g.
>> someone rebooted, clearing all the segments, and then an unrelated
>> process (malicious, perhaps, or just a completely different cluster)
>> reused the same name. This is similar to what we do for the main
>> shared memory segment.
>
> The case I am mostly wondering about is some process crashing and
> overwriting random memory. We need to be pretty sure that we'll never
> fail partially through cleaning up old segments because they are
> corrupted or because we died halfway through our last cleanup attempt.

Right. I had those considerations in mind and I believe I have nailed
the hatch shut pretty tight. The cleanup code is designed never to
die with an error. Of course it might, but it would have to be
something like an out of memory failure or similar that isn't really
what we're concerned about here. You are welcome to look for holes,
but these issues are where most of my brainpower went during
development.

>> That's true, but that decision has not been uncontroversial - e.g. the
>> NetBSD guys don't like it, because they have a big performance
>> difference between those two types of memory. We have to balance the
>> possible harm of one more setting against the benefit of letting
>> people do what they want without needing to recompile or modify code.
>
> But then, it made them fix the issue afaik :P

Pah. :-)

>> You can look at it while the server's running.
>
> That's what debuggers are for.

Tough crowd. I like it. YMMV.

>> I would never advocate deliberately trying to circumvent a
>> carefully-considered OS-level policy decision about resource
>> utilization, but I don't think that's the dynamic here. I think if we
>> insist on predetermining the dynamic shared memory implementation
>> based on the OS, we'll just be inconveniencing people needlessly, or
>> flat-out making things not work. [...]
>
> But using file-backed memory will *suck* performancewise. Why should we
> ever want to offer that to a user? That's what I was arguing about
> primarily.

I see. There might be additional writeback traffic, but it might not
be that bad in common cases. After all the data's pretty hot.

>> If we're SURE
>> that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
>> 100% of cases, and that a NetBSD user will always prefer "sysv" over
>> "mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
>> But I'm not that sure.
>
> I think posix shmem will be preferred to sysv shmem if present, in just
> about any relevant case. I don't know of any system with lower limits on
> posix shmem than on sysv.

OK, how about this.... SysV doesn't allow extending segments, but
mmap does. The thing here is that you're saying "remove mmap and keep
sysv" but Noah suggested to me that we remove sysv and keep mmap.
This suggests to me that the picture is not so black and white as you
think it is.

>> I shared your opinion that preferred_address is never going to be
>> reliable, although FWIW Noah thinks it can be made reliable with a
>> large-enough hammer.
>
> I think we need to have the arguments for that on list then. Those are
> pretty damn fundamental design decisions.
> I for one cannot see how you even remotely could make that work a) on
> windows (check the troubles we have to go through to get s_b
> consistently placed, and that's directly after startup) b) 32bit systems.

Noah?

Maybe. I think for the amount of code we're talking about here, it's
not worth getting excited about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-01 13:24:00
Message-ID:	20130901132400.GA100090@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 31, 2013 at 08:27:14AM -0400, Robert Haas wrote:
> On Fri, Aug 30, 2013 at 11:45 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> I shared your opinion that preferred_address is never going to be
> >> reliable, although FWIW Noah thinks it can be made reliable with a
> >> large-enough hammer.
> >
> > I think we need to have the arguments for that on list then. Those are
> > pretty damn fundamental design decisions.

I somewhat disfavor having a vague "preferred_address" parameter. mmap()'s
first argument is specified that way, but mmap()'s specification caters to an
open-ended range of implementations and clients. A PostgreSQL backend
interface can be more rigid. If we choose to support fixed-address callers,
let those receive either the requested address or an ereport(ERROR). If the
caller does not care, make no effort to provide a consistent address. (Better
still, under --enable-cassert, try to force the address to differ across
processes.)

[quotations reordered]
> >> But even if it isn't reliable, there doesn't seem to be all that much
> >> value in forbidding access to that part of the OS-provided API. In

That's also valid, though. Even if no core code exploits the flexibility,
3rd-party code might do so.

> >> the world where it's not reliable, it may still be convenient to map
> >> things at the same address when you can, so that pointers can't be
> >> used. Of course you'd have to have some fallback strategy for when
> >> you don't get the same mapping, and maybe that's painful enough that
> >> there's no point after all. Or maybe it's worth having one code path
> >> for relativized pointers and another for non-relativized pointers.
> >
> > It seems likely to me that will end up with untested code in that
> > case. Or even unsupported platforms.

I agree. It would take an exceptional use case to justify such parallel code
paths; I won't expect that to ever happen for core code.

> > I for one cannot see how you even remotely could make that work a) on
> > windows (check the troubles we have to go through to get s_b
> > consistently placed, and that's directly after startup) b) 32bit systems.
>
> Noah?

The difficulty depends on whether processes other than the segment's creator
will attach anytime or only as they start. Attachment at startup is enough
for parallel query, but it's not enough for something like lock table
expansion. I'll focus on the attach-anytime case since it's more general.

On a system supporting MAP_FIXED, implement this by having the postmaster
reserve address space under a PROT_NONE mapping, then carving out from that
mapping for each fixed-address dynamic segment. The size of the reservation
would be controlled by a GUC; one might set it to several times anticipated
peak usage. (The overhead of doing that depends on the kernel.) Windows
permits the same technique with its own primitives.

A system where mmap() accepts only a zero address in practice (HP-UX,
according to Gnulib, although HP docs suggest it has improved over time)
requires a different technique. For those systems, expand the regular shared
memory segment and carve from that to make "dynamic" segments. This amounts
to adding ShmemFree() to supplement ShmemAlloc(). If a core platform had to
use this implementation, its disadvantages would be sufficient to discard the
whole idea of reliable fixed addresses. But I find it acceptable if it's a
crutch for older kernels, rare hardware, etc.

I don't foresee fundamental differences on 32-bit. All the allocation
maximums scale down, but that's the usual story for 32-bit.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-01 15:08:38
Message-ID:	20130901150838.GB21721@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Noah,

On 2013-09-01 09:24:00 -0400, Noah Misch wrote:
> > >> But even if it isn't reliable, there doesn't seem to be all that much
> > >> value in forbidding access to that part of the OS-provided API. In
>
> That's also valid, though. Even if no core code exploits the flexibility,
> 3rd-party code might do so.

It seems more likely that 3rd party code misunderstands the
limitations. But perhaps that's being too picky.

> > > I for one cannot see how you even remotely could make that work a) on
> > > windows (check the troubles we have to go through to get s_b
> > > consistently placed, and that's directly after startup) b) 32bit systems.
> >
> > Noah?
>
> The difficulty depends on whether processes other than the segment's creator
> will attach anytime or only as they start. Attachment at startup is enough
> for parallel query, but it's not enough for something like lock table
> expansion. I'll focus on the attach-anytime case since it's more general.

Even on startup it might get more complicated than one immediately
imagines on EXEC_BACKEND type platforms because their memory layout
doesn't need to be the same. The more shared memory you need, the harder
that will be. Afair

> On a system supporting MAP_FIXED, implement this by having the postmaster
> reserve address space under a PROT_NONE mapping, then carving out from that
> mapping for each fixed-address dynamic segment. The size of the reservation
> would be controlled by a GUC; one might set it to several times anticipated
> peak usage. (The overhead of doing that depends on the kernel.) Windows
> permits the same technique with its own primitives.

Note that allocating a large mapping, even without using it, has
noticeable cost, at least under linux. The kernel has to create & copy
data to track each pages state (without copying the memory content's
itself due to COW) for every fork afterwards. If you don't believe me,
check the whole discussion about go's (the language) memory
management...

If that's the solution we go for why don't we just always include heaps
more shared memory and use that (remapping part of it as PROT_NONE)
instead of building the infrastructure in this patch?

> A system where mmap() accepts only a zero address in practice (HP-UX,
> according to Gnulib, although HP docs suggest it has improved over time)
> requires a different technique. For those systems, expand the regular shared
> memory segment and carve from that to make "dynamic" segments. This amounts
> to adding ShmemFree() to supplement ShmemAlloc(). If a core platform had to
> use this implementation, its disadvantages would be sufficient to discard the
> whole idea of reliable fixed addresses. But I find it acceptable if it's a
> crutch for older kernels, rare hardware, etc.
>
> I don't foresee fundamental differences on 32-bit. All the allocation
> maximums scale down, but that's the usual story for 32-bit.

If you actually want to allocate memory after starting up, without
carving a section out for that from the beginning, the memory
fragmentation will make it very hard to find memory addresses of the
same across processes.
If you go for allocating from the start what you end up will not
actually be "dynamic shared memory" because you cannot allocate much
(even if not actually backed by memory!) without compromising
performance. So we will end up with a configurable
"dynamic_shared_memory = ..." parameter. And even then, all processes,
even those not actually using the "dynamic" memory, will have to pay the
price of not being able to allocate as much memory as otherwise possible.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-01 16:07:04
Message-ID:	20130901160704.GA104151@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Sep 01, 2013 at 05:08:38PM +0200, Andres Freund wrote:
> On 2013-09-01 09:24:00 -0400, Noah Misch wrote:
> > The difficulty depends on whether processes other than the segment's creator
> > will attach anytime or only as they start. Attachment at startup is enough
> > for parallel query, but it's not enough for something like lock table
> > expansion. I'll focus on the attach-anytime case since it's more general.
>
> Even on startup it might get more complicated than one immediately
> imagines on EXEC_BACKEND type platforms because their memory layout
> doesn't need to be the same. The more shared memory you need, the harder
> that will be. Afair

Non-Windows EXEC_BACKEND is already facing a dead end that way.

> > On a system supporting MAP_FIXED, implement this by having the postmaster
> > reserve address space under a PROT_NONE mapping, then carving out from that
> > mapping for each fixed-address dynamic segment. The size of the reservation
> > would be controlled by a GUC; one might set it to several times anticipated
> > peak usage. (The overhead of doing that depends on the kernel.) Windows
> > permits the same technique with its own primitives.
>
> Note that allocating a large mapping, even without using it, has
> noticeable cost, at least under linux. The kernel has to create & copy
> data to track each pages state (without copying the memory content's
> itself due to COW) for every fork afterwards. If you don't believe me,
> check the whole discussion about go's (the language) memory
> management...

I believe you, but I'd appreciate a link to the discussion you have in mind.

> If that's the solution we go for why don't we just always include heaps
> more shared memory and use that (remapping part of it as PROT_NONE)
> instead of building the infrastructure in this patch?

There would be no freeing of the memory; a usage high water mark would stand
for the life of the postmaster.

> > I don't foresee fundamental differences on 32-bit. All the allocation
> > maximums scale down, but that's the usual story for 32-bit.
>
> If you actually want to allocate memory after starting up, without
> carving a section out for that from the beginning, the memory
> fragmentation will make it very hard to find memory addresses of the
> same across processes.

True. I wouldn't feel bad if total dynamic shared memory usage above, say,
256 MiB were unreliable on 32-bit. If you're still running 32-bit in 2015,
you probably have a low-memory platform.

I think the take-away is that we have a lot of knobs available, not a bright
line between possible and impossible. Robert opted to omit provision for
reliable fixed addresses, and the upsides of that decision are the absence of
a DBA-unfriendly space-reservation GUC, trivial overhead when the APIs are not
used, and a clearer portability outlook.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-02 22:52:22
Message-ID:	20130902225222.GD11503@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Noah!

On 2013-09-01 12:07:04 -0400, Noah Misch wrote:
> On Sun, Sep 01, 2013 at 05:08:38PM +0200, Andres Freund wrote:
> > On 2013-09-01 09:24:00 -0400, Noah Misch wrote:
> > > The difficulty depends on whether processes other than the segment's creator
> > > will attach anytime or only as they start. Attachment at startup is enough
> > > for parallel query, but it's not enough for something like lock table
> > > expansion. I'll focus on the attach-anytime case since it's more general.
> >
> > Even on startup it might get more complicated than one immediately
> > imagines on EXEC_BACKEND type platforms because their memory layout
> > doesn't need to be the same. The more shared memory you need, the harder
> > that will be. Afair
>
> Non-Windows EXEC_BACKEND is already facing a dead end that way.

Not sure whether you mean non-windows EXEC_BACKEND isn't going to be
supported for much longer or that it already has problems.

> > > On a system supporting MAP_FIXED, implement this by having the postmaster
> > > reserve address space under a PROT_NONE mapping, then carving out from that
> > > mapping for each fixed-address dynamic segment. The size of the reservation
> > > would be controlled by a GUC; one might set it to several times anticipated
> > > peak usage. (The overhead of doing that depends on the kernel.) Windows
> > > permits the same technique with its own primitives.
> >
> > Note that allocating a large mapping, even without using it, has
> > noticeable cost, at least under linux. The kernel has to create & copy
> > data to track each pages state (without copying the memory content's
> > itself due to COW) for every fork afterwards. If you don't believe me,
> > check the whole discussion about go's (the language) memory
> > management...
>
> I believe you, but I'd appreciate a link to the discussion you have in mind.

Unfortunately I could only find the first half of the discussion about
the issue. Turns out it's not the greatest idea to name your fancy new
programming language "go" (yesyes, petpeeve of mine).

http://lkml.org/lkml/2011/2/8/118
https://lwn.net/Articles/428100/

So, after reading up on the issue a bit more and reading some more
kernel code, a large mmap(PROT_NONE, MAP_PRIVATE) won't cause much
problems except counting in ulimit -v. It will *not* cause overcommit
violations. mmap(PROT_NONE, MAP_SHARED) will tho, even if not yet
faulted. Which means that to be reliable and not violate overcommit we'd
need to munmap() a chunk of PROT_NONE, MAP_PRIVATE memory, and
immediately (without interceding mallocs, using mmap itself) map it again.

It only gets really expensive in the sense of making fork expensive if
you set protections on many regions in that mapping individually. Each
mprotect() call will split the VMA into distinct pieces and they won't
get merged even if there are neighboors with the same settings.

> > > I don't foresee fundamental differences on 32-bit. All the allocation
> > > maximums scale down, but that's the usual story for 32-bit.
> >
> > If you actually want to allocate memory after starting up, without
> > carving a section out for that from the beginning, the memory
> > fragmentation will make it very hard to find memory addresses of the
> > same across processes.
>
> True. I wouldn't feel bad if total dynamic shared memory usage above, say,
> 256 MiB were unreliable on 32-bit. If you're still running 32-bit in 2015,
> you probably have a low-memory platform.

Not sure. I think that will partially depend on whether x32 will have
any success which I still find hard to judge.

> I think the take-away is that we have a lot of knobs available, not a bright
> line between possible and impossible. Robert opted to omit provision for
> reliable fixed addresses, and the upsides of that decision are the absence of
> a DBA-unfriendly space-reservation GUC, trivial overhead when the APIs are not
> used, and a clearer portability outlook.

I guess my point is that if we want to develop stuff that requires
reliable addresses, we should build support for that from a low level
up. Not rely on a hack^Wlayer ontop of the actual dynamic shared memory
API.
That is, it should be a flag to dsm_create() that we require a fixed
address and dsm_attach() will then automatically use that or die
trying. Requiring implementations to take care about passing addresses
around and fiddling with mmap/windows api to make sure those mappings
are possible doesn't strike me to be a good idea.

In the end, you're going to be the primary/first user as far as I
understand things, so you'll have to argue whether we need fixed
addresses or not. I don't think it's a good idea to forgo this decision
on this layer and bolt on another ontop if we decide it's neccessary.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-03 01:23:54
Message-ID:	CA+TgmoaDnzaKojZaJp047Muq3tw_5iaxfejLzzDzPCQqOgLSSg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 2, 2013 at 6:52 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Not sure whether you mean non-windows EXEC_BACKEND isn't going to be
> supported for much longer or that it already has problems.

I'm not sure what Noah was getting at, but I have used EXEC_BACKEND
twice now during development, in situations where I would have needed
a Windows development otherwise. So it's definitely useful, at least
to me. But on my MacBook Pro, you have to compile it with -fno-pie (I
think that's the right flag) to disable ASLR in order to get reliable
operation. I imagine such problems will become commonplace on more
and more platforms as time wears on.

> I guess my point is that if we want to develop stuff that requires
> reliable addresses, we should build support for that from a low level
> up. Not rely on a hack^Wlayer ontop of the actual dynamic shared memory
> API.
> That is, it should be a flag to dsm_create() that we require a fixed
> address and dsm_attach() will then automatically use that or die
> trying. Requiring implementations to take care about passing addresses
> around and fiddling with mmap/windows api to make sure those mappings
> are possible doesn't strike me to be a good idea.
>
> In the end, you're going to be the primary/first user as far as I
> understand things, so you'll have to argue whether we need fixed
> addresses or not. I don't think it's a good idea to forgo this decision
> on this layer and bolt on another ontop if we decide it's neccessary.

I didn't intend to punt that decision to another layer so much as
another patch and a more detailed examination of requirements. IME,
given a choice between something that is 99% reliable and provides
more functionality, or something that is 99.99% reliable and provides
less functionality, this community picks the latter every time. And
that's why I've left out any capability to insist on a fixed address
from this patch. It would be nice to have, to be sure. But it also
would take more work and add more complexity, and I don't have a clear
sense that that work would be justified.

Now, we might get to a point where it seems clear that we're not going
to get any further with parallelism without adding a capability for
fixed-address mappings. If that happens, I think that's the time to
come back to this layer and add that capability. But right now it
doesn't seem essential. Now, having said that, I didn't see any
particular reason to bury the ability to pass mmap() or shmat() a
*preferred* address. But IJWH.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-03 03:31:50
Message-ID:	20130903033150.GA119849@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 03, 2013 at 12:52:22AM +0200, Andres Freund wrote:
> On 2013-09-01 12:07:04 -0400, Noah Misch wrote:
> > On Sun, Sep 01, 2013 at 05:08:38PM +0200, Andres Freund wrote:
> > > On 2013-09-01 09:24:00 -0400, Noah Misch wrote:
> > > > The difficulty depends on whether processes other than the segment's creator
> > > > will attach anytime or only as they start. Attachment at startup is enough
> > > > for parallel query, but it's not enough for something like lock table
> > > > expansion. I'll focus on the attach-anytime case since it's more general.
> > >
> > > Even on startup it might get more complicated than one immediately
> > > imagines on EXEC_BACKEND type platforms because their memory layout
> > > doesn't need to be the same. The more shared memory you need, the harder
> > > that will be. Afair
> >
> > Non-Windows EXEC_BACKEND is already facing a dead end that way.
>
> Not sure whether you mean non-windows EXEC_BACKEND isn't going to be
> supported for much longer or that it already has problems.

It already has problems: ASLR measures sometimes prevent reattachment of the
main shared memory segment. Multiplying the combined size of our
fixed-address mappings does not push us over some threshold where this becomes
a problem, because it is already a problem.

> > > Note that allocating a large mapping, even without using it, has
> > > noticeable cost, at least under linux. The kernel has to create & copy
> > > data to track each pages state (without copying the memory content's
> > > itself due to COW) for every fork afterwards.

> So, after reading up on the issue a bit more and reading some more
> kernel code, a large mmap(PROT_NONE, MAP_PRIVATE) won't cause much
> problems except counting in ulimit -v. It will *not* cause overcommit
> violations. mmap(PROT_NONE, MAP_SHARED) will tho, even if not yet
> faulted. Which means that to be reliable and not violate overcommit we'd
> need to munmap() a chunk of PROT_NONE, MAP_PRIVATE memory, and
> immediately (without interceding mallocs, using mmap itself) map it again.
>
> It only gets really expensive in the sense of making fork expensive if
> you set protections on many regions in that mapping individually. Each
> mprotect() call will split the VMA into distinct pieces and they won't
> get merged even if there are neighboors with the same settings.

Thanks for researching that.

> > > > I don't foresee fundamental differences on 32-bit. All the allocation
> > > > maximums scale down, but that's the usual story for 32-bit.
> > >
> > > If you actually want to allocate memory after starting up, without
> > > carving a section out for that from the beginning, the memory
> > > fragmentation will make it very hard to find memory addresses of the
> > > same across processes.
> >
> > True. I wouldn't feel bad if total dynamic shared memory usage above, say,
> > 256 MiB were unreliable on 32-bit. If you're still running 32-bit in 2015,
> > you probably have a low-memory platform.
>
> Not sure. I think that will partially depend on whether x32 will have
> any success which I still find hard to judge.

I won't hold my breath for x32 becoming a common platform for high-memory
database servers, regardless of other successes it might find. Not
impossible, but I recommend placing trivial priority on maximizing performance
for that scenario.

> > I think the take-away is that we have a lot of knobs available, not a bright
> > line between possible and impossible. Robert opted to omit provision for
> > reliable fixed addresses, and the upsides of that decision are the absence of
> > a DBA-unfriendly space-reservation GUC, trivial overhead when the APIs are not
> > used, and a clearer portability outlook.
>
> I guess my point is that if we want to develop stuff that requires
> reliable addresses, we should build support for that from a low level
> up. Not rely on a hack^Wlayer ontop of the actual dynamic shared memory
> API.
> That is, it should be a flag to dsm_create() that we require a fixed
> address and dsm_attach() will then automatically use that or die
> trying. Requiring implementations to take care about passing addresses
> around and fiddling with mmap/windows api to make sure those mappings
> are possible doesn't strike me to be a good idea.

I agree.

> In the end, you're going to be the primary/first user as far as I
> understand things, so you'll have to argue whether we need fixed
> addresses or not. I don't think it's a good idea to forgo this decision
> on this layer and bolt on another ontop if we decide it's neccessary.

We don't need fixed addresses. Parallel internal sort will probably include
the equivalent of a SortTuple array in its shared memory segment, and that
implies relative pointers to the tuples also stored in shared memory. I
expect that wart to be fairly isolated within the code, so little harm done.

I don't think we will have at all painted ourselves into a corner, should we
wish to lift the limitation later.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-04 22:38:53
Message-ID:	5227B67D.8090108@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8/31/13 7:17 AM, Robert Haas wrote:
> On Thu, Aug 29, 2013 at 8:12 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>> On 8/13/13 8:09 PM, Robert Haas wrote:
>>> is removed, the segment automatically goes away (we could allow for
>>> server-lifespan segments as well with only trivial changes, but I'm
>>> not sure whether there are compelling use cases for that).
>>
>> To clarify... you're talking something that would intentionally survive
>> postmaster restart? I don't see use for that either...
>
> No, I meant something that would live as long as the postmaster and
> die when it dies.

ISTM that at some point we'll want to look at putting top-level shared memory into this system (ie: allowing dynamic resizing of GUCs that affect shared memory size).

But as you said, it'd be trivial to add that later.

>> Other comments...
>>
>> + * If the state file is empty or the contents are garbled, it probably
>> means
>> + * that the operating system rebooted before the data written by the
>> previous
>> + * postmaster made it to disk. In that case, we can just ignore it; any
>> shared
>> + * memory from before the reboot should be gone anyway.
>>
>> I'm a bit concerned about this; I know it was possible in older versions for
>> the global shared memory context to be left behind after a crash and needing
>> to clean it up by hand. Dynamic shared mem potentially multiplies that by
>> 100 or more. I think it'd be worth changing dsm_write_state_file so it
>> always writes a new file and then does an atomic mv (or something similar).
>
> I agree that the possibilities for leftover shared memory segments are
> multiplied with this new facility, and I've done my best to address
> that. However, I don't agree that writing the state file in a
> different way would improve anything.

Wouldn't it protect against a crash while writing the file? I realize the odds of that are pretty remote, but AFAIK it wouldn't cost that much to write a new file and do an atomic mv...

>> + * If some other backend exited uncleanly, it might have corrupted
>> the
>> + * control segment while it was dying. In that case, we warn and
>> ignore
>> + * the contents of the control segment. This may end up leaving
>> behind
>> + * stray shared memory segments, but there's not much we can do
>> about
>> + * that if the metadata is gone.
>>
>> Similar concern... in this case, would it be possible to always write
>> updates to an un-used slot and then atomically update a pointer? This would
>> be more work than what I suggested above, so maybe just a TODO for now...
>>
>> Though... is there anything a dying backend could do that would corrupt the
>> control segment to the point that it would screw up segments allocated by
>> other backends and not related to the dead backend? Like marking a slot as
>> not used when it is still in use and isn't associated with the dead backend?
>
> Sure. A messed-up backend can clobber the control segment just as it
> can clobber anything else in shared memory. There's really no way
> around that problem. If the control segment has been overwritten by a
> memory stomp, we can't use it to clean up. There's no way around that
> problem except to not the control segment, which wouldn't be better.

Are we trying to protect against "memory stomps" when we restart after a backend dies? I thought we were just trying to ensure that all shared data structures were correct and consistent. If that's the case, then I was thinking that by using a pointer that can be updated in a CPU-atomic fashion we know we'd never end up with a corrupted entry that was in use; the partial write would be to a slot with nothing pointing at it so it could be safely reused.

Like I said before though, it may not be worth worrying about this case right now.

>> Should dsm_impl_op sanity check the arguments after op? I didn't notice
>> checks in the type-specific code but I also didn't read all of it... are we
>> just depending on the OS to sanity-check?
>
> Sanity-check for what?

Presumably there's limits to what the arguments can be rationally set to. IIRC there's nothing down-stream that's checking them in our code, so I'm guessing we're just depending on the kernel to sanity-check.
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-05 16:37:29
Message-ID:	CA+TgmoaB9rDV+iqtEQgPuxOxDEUfrXK2K_+uZiX1wWzsvU-1pg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 4, 2013 at 6:38 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>> No, I meant something that would live as long as the postmaster and
>> die when it dies.
>
> ISTM that at some point we'll want to look at putting top-level shared
> memory into this system (ie: allowing dynamic resizing of GUCs that affect
> shared memory size).

A lot of people want that, but being able to resize the shared memory
chunk itself is only the beginning of the problem. So I wouldn't hold
my breath.

> Wouldn't it protect against a crash while writing the file? I realize the
> odds of that are pretty remote, but AFAIK it wouldn't cost that much to
> write a new file and do an atomic mv...

If there's an OS-level crash, we don't need the state file; the shared
memory will be gone anyway. And if it's a PostgreSQL-level failure,
this game neither helps nor hurts.

>> Sure. A messed-up backend can clobber the control segment just as it
>> can clobber anything else in shared memory. There's really no way
>> around that problem. If the control segment has been overwritten by a
>> memory stomp, we can't use it to clean up. There's no way around that
>> problem except to not the control segment, which wouldn't be better.
>
> Are we trying to protect against "memory stomps" when we restart after a
> backend dies? I thought we were just trying to ensure that all shared data
> structures were correct and consistent. If that's the case, then I was
> thinking that by using a pointer that can be updated in a CPU-atomic fashion
> we know we'd never end up with a corrupted entry that was in use; the
> partial write would be to a slot with nothing pointing at it so it could be
> safely reused.

When we restart after a backend dies, shared memory contents are
completely reset, from scratch. This is true of both the fixed size
shared memory segment and of the dynamic shared memory control
segment. The only difference is that, with the dynamic shared memory
control segment, we need to use the segment for cleanup before
throwing it out and starting over. Extra caution is required because
we're examining memory that could hypothetically have been stomped on;
we must not let the postmaster do anything suicidal.

>>> Should dsm_impl_op sanity check the arguments after op? I didn't notice
>>> checks in the type-specific code but I also didn't read all of it... are
>>> we
>>> just depending on the OS to sanity-check?
>>
>> Sanity-check for what?
>
> Presumably there's limits to what the arguments can be rationally set to.
> IIRC there's nothing down-stream that's checking them in our code, so I'm
> guessing we're just depending on the kernel to sanity-check.

Pretty much. It's possible more thought is needed there, but the
shape of those additional thoughts is not clear to me at this time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-06 19:40:46
Message-ID:	522A2FBE.9090104@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9/5/13 11:37 AM, Robert Haas wrote:
>> ISTM that at some point we'll want to look at putting top-level shared
>> >memory into this system (ie: allowing dynamic resizing of GUCs that affect
>> >shared memory size).
> A lot of people want that, but being able to resize the shared memory
> chunk itself is only the beginning of the problem. So I wouldn't hold
> my breath.

>> >Wouldn't it protect against a crash while writing the file? I realize the
>> >odds of that are pretty remote, but AFAIK it wouldn't cost that much to
>> >write a new file and do an atomic mv...
> If there's an OS-level crash, we don't need the state file; the shared
> memory will be gone anyway. And if it's a PostgreSQL-level failure,
> this game neither helps nor hurts.
>
>>> >>Sure. A messed-up backend can clobber the control segment just as it
>>> >>can clobber anything else in shared memory. There's really no way
>>> >>around that problem. If the control segment has been overwritten by a
>>> >>memory stomp, we can't use it to clean up. There's no way around that
>>> >>problem except to not the control segment, which wouldn't be better.
>> >
>> >Are we trying to protect against "memory stomps" when we restart after a
>> >backend dies? I thought we were just trying to ensure that all shared data
>> >structures were correct and consistent. If that's the case, then I was
>> >thinking that by using a pointer that can be updated in a CPU-atomic fashion
>> >we know we'd never end up with a corrupted entry that was in use; the
>> >partial write would be to a slot with nothing pointing at it so it could be
>> >safely reused.
> When we restart after a backend dies, shared memory contents are
> completely reset, from scratch. This is true of both the fixed size
> shared memory segment and of the dynamic shared memory control
> segment. The only difference is that, with the dynamic shared memory
> control segment, we need to use the segment for cleanup before
> throwing it out and starting over. Extra caution is required because
> we're examining memory that could hypothetically have been stomped on;
> we must not let the postmaster do anything suicidal.

Not doing something suicidal is what I'm worried about (that and not cleaning up as well as possible).

The specific scenario I'm worried about is something like a PANIC in the middle of the snprintf call in dsm_write_state_file(). That would leave that file in a completely unknown state so who knows what would then happen on restart. ISTM that writing a temp file and then doing a filesystem mv would eliminate that issue.

Or is it safe to assume that the snprintf call will be atomic since we're just spitting out a long?
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-09 16:21:31
Message-ID:	CA+TgmoZCRf0=DimouDt5jqc40aqX=N6L7jqL66O2QJ5oWS5ppQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 6, 2013 at 3:40 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> The specific scenario I'm worried about is something like a PANIC in the
> middle of the snprintf call in dsm_write_state_file(). That would leave that
> file in a completely unknown state so who knows what would then happen on
> restart. ISTM that writing a temp file and then doing a filesystem mv would
> eliminate that issue.

Doing an atomic rename would eliminate the possibility of seeing a
partially written file, but a partially written file is mostly
harmless: we'll interpret whatever bytes we see as as integer and try
to use that as a DSM key. Then we'll just see that no such shared
memory key exists (probably) or that we don't own it (probably) or
that it doesn't look like a valid control segment (probably) and
ignore it.

If someone does a kill -9 the postmaster in the middle of write()
creating a partially written file, and the partially written file
happens to identify another shared memory segment owned by the same
user ID with the correct magic number and header contents to be
interpreted as a control segment, then we will indeed erroneously blow
away that purported control segment and all other segments to which it
points. I suppose we can stick in a rename() there just to completely
rule out that scenario, but it's pretty bloody unlikely anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-13 19:32:36
Message-ID:	CA+TgmobdaUEkXWEvfE5VA_W7YL1sX03X92rOJrWmG0g_Z5PDXg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

OK, here's v2 of this patch, by myself and Amit Kapila. We made the
following changes:

- Added support for Windows. This necessitated adding an impl_private
parameter to dsm_impl_op.
- Since we had impl_private anyway, I used it to implement shm
identifier caching for the System V implementation. I like how that
turned out better than the previous version; YMMV.
- Fixed typo noted by Jim Nasby.
- Removed preferred_address parameter, per griping from Andres and Noah.
- Removed use of posix_fallocate, per recent commits.
- Added use of rename() so that we won't ever see a partially-written
state file, per griping by Jim Nasby.
- Added an overflow check so that if a user of a 32-bit system asks
for 4.1GB of dynamic shared memory, they get an error instead of
getting .1GB of memory.

Despite Andres's comments, I did not remove the mmap implementation or
the GUC that allows users to select which implementation they care to
use. I still think those things are useful. While I appreciate that
there's a marginal cost in complexity to each new GUC, I also don't
think it pays to get too cheap. There's a difference between not
requiring users to configure things that they shouldn't have to
configure, and not letting them configure things they might want to
configure. Besides, having a GUC also provides a way of turning the
feature completely off, which seems justified, at least for now, on
the basis of the (ahem) limited number of users of this facility.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
dynshmem-v2.patch	application/octet-stream	84.3 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-18 17:42:02
Message-ID:	20130918174202.GD21051@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Robert, Hi Amit,

Ok, first read through the patch.

On 2013-09-13 15:32:36 -0400, Robert Haas wrote:
> -AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
> +AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])

Maybe also check for shm_unlink or is that too absurd?

> --- /dev/null
> +++ b/src/backend/storage/ipc/dsm.c
> +#define PG_DYNSHMEM_STATE_FILE PG_DYNSHMEM_DIR "/state"
> +#define PG_DYNSHMEM_NEW_STATE_FILE PG_DYNSHMEM_DIR "/state.new"

Hm, I guess you dont't want to add it to global/ or so because of the
mmap implementation where you presumably scan the directory?

> +struct dsm_segment
> +{
> + dlist_node node; /* List link in dsm_segment_list. */
> + ResourceOwner resowner; /* Resource owner. */
> + dsm_handle handle; /* Segment name. */
> + uint32 control_slot; /* Slot in control segment. */
> + void *impl_private; /* Implementation-specific private data. */
> + void *mapped_address; /* Mapping address, or NULL if unmapped. */
> + uint64 mapped_size; /* Size of our mapping. */
> +};

Document that's backend local?

> +typedef struct dsm_control_item
> +{
> + dsm_handle handle;
> + uint32 refcnt; /* 2+ = active, 1 = moribund, 0 = gone */
> +} dsm_control_item;
> +
> +typedef struct dsm_control_header
> +{
> + uint32 magic;
> + uint32 nitems;
> + uint32 maxitems;
> + dsm_control_item item[FLEXIBLE_ARRAY_MEMBER];
> +} dsm_control_header;

And those are shared memory?

> +static void dsm_cleanup_using_control_segment(void);
> +static void dsm_cleanup_for_mmap(void);
> +static bool dsm_read_state_file(dsm_handle *h);
> +static void dsm_write_state_file(dsm_handle h);
> +static void dsm_postmaster_shutdown(int code, Datum arg);
> +static void dsm_backend_shutdown(int code, Datum arg);
> +static dsm_segment *dsm_create_descriptor(void);
> +static bool dsm_control_segment_sane(dsm_control_header *control,
> + uint64 mapped_size);
> +static uint64 dsm_control_bytes_needed(uint32 nitems);
> +
> +/* Has this backend initialized the dynamic shared memory system yet? */
> +static bool dsm_init_done = false;
> +
> +/*
> + * List of dynamic shared memory segments used by this backend.
> + *
> + * At process exit time, we must decrement the reference count of each
> + * segment we have attached; this list makes it possible to find all such
> + * segments.
> + *
> + * This list should always be empty in the postmaster. We could probably
> + * allow the postmaster to map dynamic shared memory segments before it
> + * begins to start child processes, provided that each process adjusted
> + * the reference counts for those segments in the control segment at
> + * startup time, but there's no obvious need for such a facility, which
> + * would also be complex to handle in the EXEC_BACKEND case. Once the
> + * postmaster has begun spawning children, there's an additional problem:
> + * each new mapping would require an update to the control segment,
> + * which requires locking, in which the postmaster must not be involved.
> + */
> +static dlist_head dsm_segment_list = DLIST_STATIC_INIT(dsm_segment_list);
> +
> +/*
> + * Control segment information.
> + *
> + * Unlike ordinary shared memory segments, the control segment is not
> + * reference counted; instead, it lasts for the postmaster's entire
> + * life cycle. For simplicity, it doesn't have a dsm_segment object either.
> + */
> +static dsm_handle dsm_control_handle;
> +static dsm_control_header *dsm_control;
> +static uint64 dsm_control_mapped_size = 0;
> +static void *dsm_control_impl_private = NULL;
> +
> +/*
> + * Start up the dynamic shared memory system.
> + *
> + * This is called just once during each cluster lifetime, at postmaster
> + * startup time.
> + */
> +void
> +dsm_postmaster_startup(void)
> +{
> + void *dsm_control_address = NULL;
> + uint32 maxitems;
> + uint64 segsize;
> +
> + Assert(!IsUnderPostmaster);
> +
> + /* If dynamic shared memory is disabled, there's nothing to do. */
> + if (dynamic_shared_memory_type == DSM_IMPL_NONE)
> + return;
> +
> + /*
> + * Check for, and remove, shared memory segments left behind by a dead
> + * postmaster. This isn't necessary on Windows, which always removes them
> + * when the last reference is gone.
> + */
> + switch (dynamic_shared_memory_type)
> + {
> + case DSM_IMPL_POSIX:
> + case DSM_IMPL_SYSV:
> + dsm_cleanup_using_control_segment();
> + break;
> + case DSM_IMPL_MMAP:
> + dsm_cleanup_for_mmap();
> + break;
> + case DSM_IMPL_WINDOWS:
> + /* Nothing to do. */
> + break;
> + default:
> + elog(ERROR, "unknown dynamic shared memory type: %d",
> + dynamic_shared_memory_type);
> + }
> +
> + /* Determine size for new control segment. */
> + maxitems = PG_DYNSHMEM_FIXED_SLOTS
> + + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;

It seems likely that MaxConnections would be sufficient?

> + elog(DEBUG2, "dynamic shared memory system will support %lu segments",
> + (unsigned long) maxitems);
> + segsize = dsm_control_bytes_needed(maxitems);
> +
> + /* Create new control segment. */
> + for (;;)
> + {
> + Assert(dsm_control_address == NULL);
> + Assert(dsm_control_mapped_size == 0);
> + dsm_control_handle = random();
> + if (dsm_impl_op(DSM_OP_CREATE, dsm_control_handle, segsize,
> + &dsm_control_impl_private, &dsm_control_address,
> + &dsm_control_mapped_size, ERROR))
> + break;
> + }

Comment that we loop endlessly to find a unused identifier.

Why do we create the control segment in dynamic smem and not in the
normal shmem? Presumably because this way it has the same lifetime? If
so, that should be commented upon.

> +static void
> +dsm_cleanup_using_control_segment(void)
> +{
> + /*
> + * We've managed to reattach it, but the contents might not be sane.
> + * If they aren't, we disregard the segment after all.
> + */
> + old_control = (dsm_control_header *) mapped_address;
> + if (!dsm_control_segment_sane(old_control, mapped_size))
> + {
> + dsm_impl_op(DSM_OP_DETACH, old_control_handle, 0, &impl_private,
> + &mapped_address, &mapped_size, LOG);
> + return;
> + }

So, we leave it just hanging around... Well, it has precedent in our
normal shared memory handling.

> +static void
> +dsm_cleanup_for_mmap(void)
> +{
...
> +}

I still maintain that the extra infrastructure required isn't worth the
gain of having the mmap implementation.

> +/*
> + * Read and parse the state file.
> + *
> + * If the state file is empty or the contents are garbled, it probably means
> + * that the operating system rebooted before the data written by the previous
> + * postmaster made it to disk. In that case, we can just ignore it; any shared
> + * memory from before the reboot should be gone anyway.
> + */
> +static bool
> +dsm_read_state_file(dsm_handle *h)
> +{
...
> + return true;
> +}

Perhaps CRC32 the content?

> +/*
> + * Write our control segment handle to the state file, so that if the
> + * postmaster is killed without running it's on_shmem_exit hooks, the
> + * next postmaster can clean things up after restart.
> + */
> +static void
> +dsm_write_state_file(dsm_handle h)
> +{
> + int statefd;
> + char statebuf[PG_DYNSHMEM_STATE_BUFSIZ];
> + int nbytes;
> +
> + /* Create or truncate the file. */
> + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);

Doesn't this need a | PG_BINARY? Why are you using open() and not
BasicOpenFile or even OpenTransientFile?

> + /* Write contents. */
> + snprintf(statebuf, PG_DYNSHMEM_STATE_BUFSIZ, "%lu\n",
> + (unsigned long) dsm_control_handle);

Why are we upcasting the length of dsm_control_handle here? Also,
doesn't this need the usual UINT64_FORMAT thingy?

> +/*
> + * At shutdown time, we iterate over the control segment and remove all
> + * remaining dynamic shared memory segments. We avoid throwing errors here;
> + * the postmaster is shutting down either way, and this is just non-critical
> + * resource cleanup.
> + */
> +static void
> +dsm_postmaster_shutdown(int code, Datum arg)
> +{
> + uint32 nitems;
> + uint32 i;
> + void *dsm_control_address;
> + void *junk_mapped_address = NULL;
> + void *junk_impl_private = NULL;
> + uint64 junk_mapped_size = 0;
> +
> + /* If dynamic shared memory is disabled, there's nothing to do. */
> + if (dynamic_shared_memory_type == DSM_IMPL_NONE)
> + return;

But we don't even get calld in that case, right? You're only regitering
the on_shmem_exit() handler after the same check in startup?

> + /*
> + * If some other backend exited uncleanly, it might have corrupted the
> + * control segment while it was dying. In that case, we warn and ignore
> + * the contents of the control segment. This may end up leaving behind
> + * stray shared memory segments, but there's not much we can do about
> + * that if the metadata is gone.
> + */
> + nitems = dsm_control->nitems;
> + if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size))
> + {
> + ereport(LOG,
> + (errmsg("dynamic shared memory control segment is corrupt")));
> + return;
> + }

I'd rename dsm_control_segment_sane to dsm_control_segment_looks_sane ;)

> + /* Remove any remaining segments. */
> + for (i = 0; i < nitems; ++i)
> + {
> + dsm_handle handle;
> +
> + /* If the reference count is 0, the slot is actually unused. */
> + if (dsm_control->item[i].refcnt == 0)
> + continue;
> +
> + /* Log debugging information. */
> + handle = dsm_control->item[i].handle;
> + elog(DEBUG2, "cleaning up orphaned dynamic shared memory with ID %lu",
> + (unsigned long) handle);

I'd include the refcount here, might be helpful for debugging.

> + /* Destroy the segment. */
> + dsm_impl_op(DSM_OP_DESTROY, handle, 0, &junk_impl_private,
> + &junk_mapped_address, &junk_mapped_size, LOG);
> + }
> +
> + /* Remove the control segment itself. */
> + elog(DEBUG2,
> + "cleaning up dynamic shared memory control segment with ID %lu",
> + (unsigned long) dsm_control_handle);
> + dsm_control_address = dsm_control;
> + dsm_impl_op(DSM_OP_DESTROY, dsm_control_handle, 0,
> + &dsm_control_impl_private, &dsm_control_address,
> + &dsm_control_mapped_size, LOG);
> + dsm_control = dsm_control_address;
> +
> + /* And, finally, remove the state file. */
> + if (unlink(PG_DYNSHMEM_STATE_FILE) < 0)
> + ereport(LOG,
> + (errcode_for_file_access(),
> + errmsg("could not unlink file \"%s\": %m",
> + PG_DYNSHMEM_STATE_FILE)));
> +}

Not sure whether it's sensible to only LOG in these cases. After all
there's something unexpected happening. The robustness argument doesn't
count since we're already shutting down.

> +/*
> + * Prepare this backend for dynamic shared memory usage. Under EXEC_BACKEND,
> + * we must reread the state file and map the control segment; in other cases,
> + * we'll have inherited the postmaster's mapping and global variables.
> + */
> +static void
> +dsm_backend_startup(void)
> +{
> +
> +#ifdef EXEC_BACKEND
> + {
> + dsm_handle control_handle;
> + void *control_address = NULL;
> +
> + /* Read the control segment information from the state file. */
> + if (!dsm_read_state_file(&control_handle))
> + ereport(ERROR,
> + (errcode(ERRCODE_INTERNAL_ERROR),
> + errmsg("could not parse dynamic shared memory state file")));
> +
> + /* Attach control segment. */
> + dsm_impl_op(DSM_OP_ATTACH, control_handle, 0,
> + &dsm_control_impl_private, &control_address,
> + &dsm_control_mapped_size, ERROR);
> + dsm_control_handle = control_handle;
> + dsm_control = control_address;
> + /* If control segment doesn't look sane, something is badly wrong. */
> + if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size))
> + {
> + dsm_impl_op(DSM_OP_DETACH, control_handle, 0,
> + &dsm_control_impl_private, &control_address,
> + &dsm_control_mapped_size, WARNING);
> + ereport(ERROR,
> + (errcode(ERRCODE_INTERNAL_ERROR),
> + errmsg("dynamic shared memory control segment is not valid")));
> + }

Imo that's a PANIC or at the very least a FATAL.

> +
> +/*
> + * Create a new dynamic shared memory segment.
> + */
> +dsm_segment *
> +dsm_create(uint64 size)
> +{
...
> + /* Verify that we can support an additional mapping. */
> + if (nitems >= dsm_control->maxitems)
> + ereport(ERROR,
> + (errcode(ERRCODE_INSUFFICIENT_RESOURCES),
> + errmsg("too many dynamic shared memory segments")));

Do we rely on being run in an environment with proper setup for lwlock
cleanup? I can imagine shared libraries doing this pretty early on...

> +dsm_segment *
> +dsm_attach(dsm_handle h)
> +{

> + /* Bump reference count for this segment in shared memory. */
> + LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
> + nitems = dsm_control->nitems;
> + for (i = 0; i < nitems; ++i)
> + {
> + /* If the reference count is 0, the slot is actually unused. */
> + if (dsm_control->item[i].refcnt == 0)
> + continue;
> +
> + /*
> + * If the reference count is 1, the slot is still in use, but the
> + * segment is in the process of going away. Treat that as if we
> + * didn't find a match.
> + */
> + if (dsm_control->item[i].refcnt == 1)
> + break;

Why can we see that state? Shouldn't locking prevent that?

> +/*
> + * Resize an existing shared memory segment.
> + *
> + * This may cause the shared memory segment to be remapped at a different
> + * address. For the caller's convenience, we return the mapped address.
> + */
> +void *
> +dsm_resize(dsm_segment *seg, uint64 size)
> +{
> + Assert(seg->control_slot != INVALID_CONTROL_SLOT);
> + dsm_impl_op(DSM_OP_RESIZE, seg->handle, size, &seg->impl_private,
> + &seg->mapped_address, &seg->mapped_size, ERROR);
> + return seg->mapped_address;
> +}

Hm. That's valid when there are other backends attached? What are the
implications for already attached ones?

Shouldn't we error out if !dsm_impl_can_resize()?

> +/*
> + * Detach from a shared memory segment, destroying the segment if we
> + * remove the last reference.
> + *
> + * This function should never fail. It will often be invoked when aborting
> + * a transaction, and a further error won't serve any purpose. It's not a
> + * complete disaster if we fail to unmap or destroy the segment; it means a
> + * resource leak, but that doesn't necessarily preclude further operations.
> + */
> +void
> +dsm_detach(dsm_segment *seg)
> +{

Why do we want to ignore errors like failing to unmap? ISTM that
indicates an actual problem...

> + /* Reduce reference count, if we previously increased it. */
> + if (seg->control_slot != INVALID_CONTROL_SLOT)
> + {
> + uint32 refcnt;
> + uint32 control_slot = seg->control_slot;
> +
> + LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
> + Assert(dsm_control->item[control_slot].handle == seg->handle);
> + Assert(dsm_control->item[control_slot].refcnt > 1);
> + refcnt = --dsm_control->item[control_slot].refcnt;
> + seg->control_slot = INVALID_CONTROL_SLOT;
> + LWLockRelease(DynamicSharedMemoryControlLock);
> +
> + /* If new reference count is 1, try to destroy the segment. */
> + if (refcnt == 1)
> + {
> + /*
> + * If we fail to destroy the segment here, or are killed before
> + * we finish doing so, the reference count will remain at 1, which
> + * will mean that nobody else can attach to the segment. At
> + * postmaster shutdown time, or when a new postmaster is started
> + * after a hard kill, another attempt will be made to remove the
> + * segment.
> + *
> + * The main case we're worried about here is being killed by
> + * a signal before we can finish removing the segment. In that
> + * case, it's important to be sure that the segment still gets
> + * removed. If we actually fail to remove the segment for some
> + * other reason, the postmaster may not have any better luck than
> + * we did. There's not much we can do about that, though.
> + */
> + if (dsm_impl_op(DSM_OP_DESTROY, seg->handle, 0, &seg->impl_private,
> + &seg->mapped_address, &seg->mapped_size, WARNING))
> + {
> + LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);
> + Assert(dsm_control->item[control_slot].handle == seg->handle);
> + Assert(dsm_control->item[control_slot].refcnt == 1);
> + dsm_control->item[control_slot].refcnt = 0;
> + LWLockRelease(DynamicSharedMemoryControlLock);
> + }
> + }
> + }

Yuck. So that's the answer to my earlier question about the legality of
seing a refcount of 1....

> diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
> new file mode 100644
> index 0000000..8005fd9
> @@ -0,0 +1,986 @@
> +/*-------------------------------------------------------------------------
> + *
> + * dsm_impl.c
> + * manage dynamic shared memory segments
> + *
> + * This file provides low-level APIs for creating and destroying shared
> + * memory segments using several different possible techniques. We refer
> + * to these segments as dynamic because they can be created, altered, and
> + * destroyed at any point during the server life cycle. This is unlike
> + * the main shared memory segment, of which there is always exactly one
> + * and which is always mapped at a fixed address in every PostgreSQL
> + * background process.
> + *
> + * Because not all systems provide the same primitives in this area, nor
> + * do all primitives behave the saem way on all systems, we provide

*same

> + * several implementations of this facility. Many systems implement
> + * POSIX shared memory (shm_open etc.), which is well-suited to our needs
> + * in this area, with the exception that shared memory identifiers live
> + * in a flat system-wide namespace, raising the uncomfortable prospect of
> + * name collisions with other processes (including other copies of
> + * PostgreSQL) running on the same system.

Why isn't the port number part of the posix shmem identifiers? Sure, we
retry, but using a logic similar to sysv_shmem.c seems like a good idea.

> +/*
> + * Does the current dynamic shared memory implementation support resizing
> + * segments? (The answer here could be platform-dependent in the future,
> + * since AIX allows shmctl(shmid, SHM_RESIZE, &buffer), though you apparently
> + * can't resize segments to anything larger than 256MB that way. For now,
> + * we keep it simple.)
> + */
> +bool
> +dsm_impl_can_resize(void)
> +{
> + switch (dynamic_shared_memory_type)
> + {
> + case DSM_IMPL_NONE:
> + return false;
> + case DSM_IMPL_SYSV:
> + return false;
> + case DSM_IMPL_WINDOWS:
> + return false;
> + default:
> + return true;
> + }
> +}

Looks to me like the logic should be the reverse.

> +#ifdef USE_DSM_POSIX
> +/*
> + * Operating system primitives to support POSIX shared memory.
> + *
> + * POSIX shared memory segments are created and attached using shm_open()
> + * and shm_unlink(); other operations, such as sizing or mapping the
> + * segment, are performed as if the shared memory segments were files.
> + *
> + * Indeed, on some platforms, they may be implemented that way. While
> + * POSIX shared memory segments seem intended to exist in a flat namespace,
> + * some operating systems may implement them as files, even going so far
> + * to treat a request for /xyz as a request to create a file by that name
> + * in the root directory. Users of such broken platforms should select
> + * a different shared memory implementation.
> + */
> +static bool
> +dsm_impl_posix(dsm_op op, dsm_handle handle, uint64 request_size,
> + void **impl_private, void **mapped_address, uint64 *mapped_size,
> + int elevel)
> +{
> + char name[64];
> + int flags;
> + int fd;
> + char *address;
> +
> + snprintf(name, 64, "/PostgreSQL.%lu", (unsigned long) handle);

Why wider than the handle?

> + /*
> + * If we're reattaching or resizing, we must remove any existing mapping,
> + * unless we've already got the right thing mapped.
> + */
> + if (*mapped_address != NULL)
> + {
> + if (*mapped_size == request_size)
> + return true;

Hm. It could have gotten resized to the old size, or resized twice. In
that case it might not be at the same address before, so checking for
the size doesn't seem to be sufficient.

> +static bool
> +dsm_impl_sysv(dsm_op op, dsm_handle handle, uint64 request_size,
> + void **impl_private, void **mapped_address, uint64 *mapped_size,
> + int elevel)
> +{

> + /*
> + * There's one special key, IPC_PRIVATE, which can't be used. If we end
> + * up with that value by chance during a create operation, just pretend
> + * it already exists, so that caller will retry. If we run into it
> + * anywhere else, the caller has passed a handle that doesn't correspond
> + * to anything we ever created, which should not happen.
> + */
> + if (key == IPC_PRIVATE)
> + {
> + if (op != DSM_OP_CREATE)
> + elog(elevel, "System V shared memory key may not be IPC_PRIVATE");
> + errno = EEXIST;
> + return false;
> + }

Hm. You're elog(elevel) here, but the retry code in dsm_create() passes
in ERROR?

> +static bool
> +dsm_impl_mmap(dsm_op op, dsm_handle handle, uint64 request_size,
> + void **impl_private, void **mapped_address, uint64 *mapped_size,
> + int elevel)
> +{

Same think like in posix shmem.

> +static int
> +errcode_for_dynamic_shared_memory()
> +{
> + if (errno == EFBIG || errno == ENOMEM)
> + return errcode(ERRCODE_OUT_OF_MEMORY);
> + else
> + return errcode_for_file_access();
> +}

Is EFBIG guaranteed to be defined?

> +/*
> + * Forget that a temporary file is owned by a ResourceOwner
> + */
> +void
> +ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *seg)
> +{
> + dsm_segment **dsms = owner->dsms;
> + int ns1 = owner->ndsms - 1;
> + int i;
> +
> + for (i = ns1; i >= 0; i--)
> + {
> + if (dsms[i] == seg)
> + {
> + while (i < ns1)
> + {
> + dsms[i] = dsms[i + 1];
> + i++;
> + }
> + owner->ndsms = ns1;
> + return;
> + }
> + }
> + elog(ERROR,
> + "dynamic shared memory segment %lu is not owned by resource owner %s",
> + (unsigned long) dsm_segment_handle(seg), owner->name);
> +}

Not really an issue, but this will grow owner->dsm unnecessarily because
ResourceOwnerEnlargeDSMs() will have been done previously.

Not sure yet how happy I am with the separation of concerns between
dsm.c and dsm_impl.c...

That's it for now.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-19 15:44:34
Message-ID:	CA+TgmobKLeC8nYWaa6aTW5gvXktuRf6TRttoJbYroi=Nno4qUw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 18, 2013 at 1:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Maybe also check for shm_unlink or is that too absurd?

Well, if we find that that there are systems were shm_open is present
and shm_unlink is not present, then we'll need such a test. But I
hesitate to decide what the right thing to do on such systems without
knowing more. And the time it takes to run configure is a very
substantial percentage of the complete build time, at least on my box,
so I don't really want to add tests just because they might be needed
somewhere.

>> --- /dev/null
>> +++ b/src/backend/storage/ipc/dsm.c
>> +#define PG_DYNSHMEM_STATE_FILE PG_DYNSHMEM_DIR "/state"
>> +#define PG_DYNSHMEM_NEW_STATE_FILE PG_DYNSHMEM_DIR "/state.new"
>
> Hm, I guess you dont't want to add it to global/ or so because of the
> mmap implementation where you presumably scan the directory?

Yes, and also because I thought this way would make it easier to teach
things like pg_basebackup (or anybody's home-brew scripts) to just
skip that directory completely. Actually, I was wondering if we ought
to have a directory under pgdata whose explicit charter it was to
contain files that shouldn't be copied as part of a base backup.
pg_do_not_back_this_up.

> Document that's backend local?
> And those are shared memory?

Sure, done.

>> + /* Determine size for new control segment. */
>> + maxitems = PG_DYNSHMEM_FIXED_SLOTS
>> + + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;
>
> It seems likely that MaxConnections would be sufficient?

I think we could argue about the best way to set this until the cows
come home, but I don't think it probably matters much at this point.
We can always change the formula later as we gain experience.
However, I don't have a principled reason for assuming that only
user-connected backends will create dynamic shared memory segments.

> Comment that we loop endlessly to find a unused identifier.

Done.

> Why do we create the control segment in dynamic smem and not in the
> normal shmem? Presumably because this way it has the same lifetime? If
> so, that should be commented upon.

If it were part of the normal shared memory segment, I don't think
there'd be any good way to implement
dsm_cleanup_using_control_segment().

> So, we leave it just hanging around... Well, it has precedent in our
> normal shared memory handling.

And, it could be long to some unrelated process.

> I still maintain that the extra infrastructure required isn't worth the
> gain of having the mmap implementation.

I know.

>> +/*
>> + * Read and parse the state file.
>> + *
>
> Perhaps CRC32 the content?

I don't see the point. If the file contents are garbage that happens
to look like a number, we'll go "oh, there isn't any such segment" or
"oh, there is such a segment but it doesn't look like a control
segment, so forget it". There are a lot of things we really ought to
be CRCing to avoid corruption risk, but I can't see how this is
remotely one of them.

>> + /* Create or truncate the file. */
>> + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
>
> Doesn't this need a | PG_BINARY?

It's a text file. Do we need PG_BINARY anyway?

> Why are you using open() and not
> BasicOpenFile or even OpenTransientFile?

Because those don't work in the postmaster.

>
>> + /* Write contents. */
>> + snprintf(statebuf, PG_DYNSHMEM_STATE_BUFSIZ, "%lu\n",
>> + (unsigned long) dsm_control_handle);
>
> Why are we upcasting the length of dsm_control_handle here? Also,
> doesn't this need the usual UINT64_FORMAT thingy?

dsm_handle is an alias for uint32. Is that always exactly an unsigned
int or can it sometimes be an unsigned long? I thought the latter, so
couldn't figure out how to write this portably without casting to a
type that explicitly matched the format string.

> But we don't even get calld in that case, right? You're only regitering
> the on_shmem_exit() handler after the same check in startup?

True. Removed.

> I'd rename dsm_control_segment_sane to dsm_control_segment_looks_sane ;)

Meh. I write too many really long function names as it is.

> I'd include the refcount here, might be helpful for debugging.

OK, done.

> Not sure whether it's sensible to only LOG in these cases. After all
> there's something unexpected happening. The robustness argument doesn't
> count since we're already shutting down.

I see no point in throwing an error. The fact that we're having
trouble cleaning up one dynamic shared memory segment doesn't mean we
shouldn't try to clean up others, or that any remaining postmaster
shutdown hooks shouldn't be executed.

>> + ereport(ERROR,
>> + (errcode(ERRCODE_INTERNAL_ERROR),
>> + errmsg("dynamic shared memory control segment is not valid")));
>
> Imo that's a PANIC or at the very least a FATAL.

Sure, that's a tempting option, but it doesn't seem to serve any very
necessary point. There's no data corruption problem if we proceed
here. Most likely either (1) there's a bug in the code, which
panicking won't fix or (2) the DBA hand-edited the state file, in
which case maybe he shouldn't have done that, but if he thinks the
best way to recover from that is a cluster-wide restart, he can do
that himself.

>> +dsm_segment *
>> +dsm_create(uint64 size)
>
> Do we rely on being run in an environment with proper setup for lwlock
> cleanup? I can imagine shared libraries doing this pretty early on...

Yes, we rely on that. I don't really see that as a problem. You'd
better connect to the main shared memory segment before starting to
create your own.

> Why can we see that state? Shouldn't locking prevent that?

We initialize the refcnt to 2 on creation, so that it ends up as 1
when the last reference is gone. This prevents race conditions:
imagine that one process creates a DSM and then passes the handle to
some other process. But, before that second process gets around to
mapping it, the first process dies. At that point, there are no
remaining references to the dynamic shared memory segment, so it goes
away, as per spec. But what if we're in the middle of destroying when
the second process finally gets around to trying to map it? In such a
situation, the behavior would be implementation-specific. I felt that
it was better not to find out whether all operating systems and shared
memory types handle such cases similarly, because I'm pretty sure they
don't. The two-stage destruction process means that once we've
committed to destroying the segment, no one else will attempt to map
it.

>> +void *
>> +dsm_resize(dsm_segment *seg, uint64 size)
>> +{
>> + Assert(seg->control_slot != INVALID_CONTROL_SLOT);
>> + dsm_impl_op(DSM_OP_RESIZE, seg->handle, size, &seg->impl_private,
>> + &seg->mapped_address, &seg->mapped_size, ERROR);
>> + return seg->mapped_address;
>> +}
>
> Hm. That's valid when there are other backends attached? What are the
> implications for already attached ones?

They'll continue to see the portion they have mapped, but must do
dsm_remap() if they want to see the whole thing.

> Shouldn't we error out if !dsm_impl_can_resize()?

The implementation-specific code throws an error if it can't support
resize. Even if we put a secondary check here, I wouldn't want
dsm_impl_op to behave in an undefined manner when asked to resize
under an implementation that can't. And there doesn't seem to be much
point in having two checks.

>> +void
>> +dsm_detach(dsm_segment *seg)
>> +{
>
> Why do we want to ignore errors like failing to unmap? ISTM that
> indicates an actual problem...

Sure it does. But what are you going to do about it? In many cases,
you're going to get here during a transaction abort caused by some
other error. If the transaction is already aborting, throwing an
error here will just cause the original error to get discarded in
favor of showing this one, or maybe it's the other way around. I
don't remember, but it's definitely one or the other, and neither is
desirable. Throwing a warning, on the other hand, will notify the
user, which is what we want.

Now on the flip side we might not be aborting; maybe we're committing.
But we don't want to turn a commit into an abort just for this. If
resowner.c detects a buffer pin leak or a tuple descriptor leak, those
are "just" warning as well. They're serious warnings, of course, and
if they happen it means there's a bug in the code that needs to be
fixed. But the severity of an ereport() isn't based just on how
alarming the situation is; it's based on what you want to happen when
that situation comes up. And we've decided (correctly, I think) that
resource leaks are not grounds for aborting a transaction that
otherwise would have committed.

> Yuck. So that's the answer to my earlier question about the legality of
> seing a refcount of 1....

Read it and weep.

> *same

Fixed, thanks.

>> + * several implementations of this facility. Many systems implement
>> + * POSIX shared memory (shm_open etc.), which is well-suited to our needs
>> + * in this area, with the exception that shared memory identifiers live
>> + * in a flat system-wide namespace, raising the uncomfortable prospect of
>> + * name collisions with other processes (including other copies of
>> + * PostgreSQL) running on the same system.
>
> Why isn't the port number part of the posix shmem identifiers? Sure, we
> retry, but using a logic similar to sysv_shmem.c seems like a good idea.

According to the man page for shm_open on Solaris, "For maximum
portability, name should include no more than 14 characters, but this
limit is not enforced."

http://www.unix.com/man-page/OpenSolaris/3c/shm_open/

I'm unclear whether there are any real systems that have a problem with this.

>> +bool
>> +dsm_impl_can_resize(void)
>> +{
>> + switch (dynamic_shared_memory_type)
>> + {
>> + case DSM_IMPL_NONE:
>> + return false;
>> + case DSM_IMPL_SYSV:
>> + return false;
>> + case DSM_IMPL_WINDOWS:
>> + return false;
>> + default:
>> + return true;
>> + }
>> +}
>
> Looks to me like the logic should be the reverse.

I've changed it to list all the cases explicitly.

>> + char name[64];
>> + int flags;
>> + int fd;
>> + char *address;
>> +
>> + snprintf(name, 64, "/PostgreSQL.%lu", (unsigned long) handle);
>
> Why wider than the handle?

Same as above - not sure that uint32 == unsigned int everywhere.

>> + /*
>> + * If we're reattaching or resizing, we must remove any existing mapping,
>> + * unless we've already got the right thing mapped.
>> + */
>> + if (*mapped_address != NULL)
>> + {
>> + if (*mapped_size == request_size)
>> + return true;
>
> Hm. It could have gotten resized to the old size, or resized twice. In
> that case it might not be at the same address before, so checking for
> the size doesn't seem to be sufficient.

I don't understand your concern. If someone resize the DSM to its
already-current size, there is no need to remap it. The old mapping
is just fine. And if some other backend resizes the DSM to a larger
size and then back to the original size, and then we're asked to
update the mapping, there is no need to change anything.

> Hm. You're elog(elevel) here, but the retry code in dsm_create() passes
> in ERROR?

Oh, that's bad. Fixed.

>> +static int
>> +errcode_for_dynamic_shared_memory()
>> +{
>> + if (errno == EFBIG || errno == ENOMEM)
>> + return errcode(ERRCODE_OUT_OF_MEMORY);
>> + else
>> + return errcode_for_file_access();
>> +}
>
> Is EFBIG guaranteed to be defined?

I dunno. We could put an #ifdef around that part. Should we do that
now or wait and see if it actually breaks anywhere?

> Not really an issue, but this will grow owner->dsm unnecessarily because
> ResourceOwnerEnlargeDSMs() will have been done previously.

I tried to copy the existing uses of resowner.c as closely as
possible; if you think that there's something I should be doing to
mimic it more closely, let me know.

> Not sure yet how happy I am with the separation of concerns between
> dsm.c and dsm_impl.c...

I was hoping for a cleaner abstraction break, but I couldn't make it
work out any better than this. Even so, I think it's worthwhile
having two files; an imperfect separation of concerns still seems
better than concatenating them into one really long file. (FWIW, the
combined file would be longer than 84% of the 622 .c files in the
backend; as a fan of keeping .c files relatively small, I'm not eager
to be the cause of us having more large ones.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
dynshmem-v3.patch	application/octet-stream	84.6 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-20 11:44:45
Message-ID:	20130920114445.GB5287@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2013-09-19 11:44:34 -0400, Robert Haas wrote:
> On Wed, Sep 18, 2013 at 1:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> >> --- /dev/null
> >> +++ b/src/backend/storage/ipc/dsm.c
> >> +#define PG_DYNSHMEM_STATE_FILE PG_DYNSHMEM_DIR "/state"
> >> +#define PG_DYNSHMEM_NEW_STATE_FILE PG_DYNSHMEM_DIR "/state.new"
> >
> > Hm, I guess you dont't want to add it to global/ or so because of the
> > mmap implementation where you presumably scan the directory?
>
> Yes, and also because I thought this way would make it easier to teach
> things like pg_basebackup (or anybody's home-brew scripts) to just
> skip that directory completely. Actually, I was wondering if we ought
> to have a directory under pgdata whose explicit charter it was to
> contain files that shouldn't be copied as part of a base backup.
> pg_do_not_back_this_up.

Wondered exactly about that as soon as you've mentioned
pg_basebackup. pg_local/?

> >> + /* Determine size for new control segment. */
> >> + maxitems = PG_DYNSHMEM_FIXED_SLOTS
> >> + + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;
> >
> > It seems likely that MaxConnections would be sufficient?
>
> I think we could argue about the best way to set this until the cows
> come home, but I don't think it probably matters much at this point.
> We can always change the formula later as we gain experience.
> However, I don't have a principled reason for assuming that only
> user-connected backends will create dynamic shared memory segments.

Hm, yes. I had MaxBackends down as MaxConnections + autovacuum stuff;
but max_worker_processes are in there now, so you're right that doesn't
make sense.

> >> +/*
> >> + * Read and parse the state file.
> >> + *
> >
> > Perhaps CRC32 the content?
>
> I don't see the point. If the file contents are garbage that happens
> to look like a number, we'll go "oh, there isn't any such segment" or
> "oh, there is such a segment but it doesn't look like a control
> segment, so forget it". There are a lot of things we really ought to
> be CRCing to avoid corruption risk, but I can't see how this is
> remotely one of them.

I was worried about a partially written file or containing contents from
two different postmaster cycles, but it's actually far too small for
that...
I initially had thought you'd write the contents of the entire shared
control segment there, not just it's id.

> >> + /* Create or truncate the file. */
> >> + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
> >
> > Doesn't this need a | PG_BINARY?
>
> It's a text file. Do we need PG_BINARY anyway?

I'd say yes. Non binary mode stuff on windows does stuff like
transforming LF <=> CRLF on reading/writing, which makes sizes not match
up and similar ugliness.
Imo there's little reason to use non-binary mode for anything written
for postgres' own consumption.

> > Why are you using open() and not
> > BasicOpenFile or even OpenTransientFile?
>
> Because those don't work in the postmaster.

Oh, that's news to me. Seems strange, especially for BasicOpenFile.

> >
> >> + /* Write contents. */
> >> + snprintf(statebuf, PG_DYNSHMEM_STATE_BUFSIZ, "%lu\n",
> >> + (unsigned long) dsm_control_handle);
> >
> > Why are we upcasting the length of dsm_control_handle here? Also,
> > doesn't this need the usual UINT64_FORMAT thingy?
>
> dsm_handle is an alias for uint32. Is that always exactly an unsigned
> int or can it sometimes be an unsigned long? I thought the latter, so
> couldn't figure out how to write this portably without casting to a
> type that explicitly matched the format string.

That should always be an unsigned int on platforms we support. Note that
we've printed TransactionIds that way (i.e. %u) for a long time and they
are a uint32 as well.

> > Not sure whether it's sensible to only LOG in these cases. After all
> > there's something unexpected happening. The robustness argument doesn't
> > count since we're already shutting down.

> I see no point in throwing an error. The fact that we're having
> trouble cleaning up one dynamic shared memory segment doesn't mean we
> shouldn't try to clean up others, or that any remaining postmaster
> shutdown hooks shouldn't be executed.

Well, it means we'll do a regular shutdown instead of PANICing
and *not* writing a checkpoint.
If something has corrupted our state to the point we cannot unregister
shared memory we registered, something has gone terribly wrong. Quite
possibly we've scribbled over our control structures or such. In that
case it's not proper to do a normal shutdown, we're quite possibly
writing bad data.

> >> + ereport(ERROR,
> >> + (errcode(ERRCODE_INTERNAL_ERROR),
> >> + errmsg("dynamic shared memory control segment is not valid")));
> >
> > Imo that's a PANIC or at the very least a FATAL.
>
> Sure, that's a tempting option, but it doesn't seem to serve any very
> necessary point. There's no data corruption problem if we proceed
> here. Most likely either (1) there's a bug in the code, which
> panicking won't fix or (2) the DBA hand-edited the state file, in
> which case maybe he shouldn't have done that, but if he thinks the
> best way to recover from that is a cluster-wide restart, he can do
> that himself.

"There's no data corruption problem if we proceed" - but there likely
has been one leading to the current state.

> >> +dsm_segment *
> >> +dsm_create(uint64 size)
> >
> > Do we rely on being run in an environment with proper setup for lwlock
> > cleanup? I can imagine shared libraries doing this pretty early on...
>
> Yes, we rely on that. I don't really see that as a problem. You'd
> better connect to the main shared memory segment before starting to
> create your own.

I am not talking about lwlocks itself being setup but an environment
that has resource owners defined and catches errors. I am specifically
asking because you're a) ereport()ing without releasing an LWLock b)
unconditionally relying on the fact that there's a current resource
owner.
In shared_preload_libraries neither is the case afair?

Now, you could very well argue that you don't need to use dsm for
shared_preload_libraries but there are enough libraries that you can use
per session or globally. Requiring them to use both implementation or
register stuff later seems like it would complicate things.

> >> +void *
> >> +dsm_resize(dsm_segment *seg, uint64 size)
> >> +{
> >> + Assert(seg->control_slot != INVALID_CONTROL_SLOT);
> >> + dsm_impl_op(DSM_OP_RESIZE, seg->handle, size, &seg->impl_private,
> >> + &seg->mapped_address, &seg->mapped_size, ERROR);
> >> + return seg->mapped_address;
> >> +}
> >
> > Hm. That's valid when there are other backends attached? What are the
> > implications for already attached ones?
>
> They'll continue to see the portion they have mapped, but must do
> dsm_remap() if they want to see the whole thing.

But resizing can shrink, can it not? And we do an ftruncate() in at
least the posix shmem case. Which means the other backend will get a
SIGSEGV accessing that memory IIRC.

> > Shouldn't we error out if !dsm_impl_can_resize()?
>
> The implementation-specific code throws an error if it can't support
> resize. Even if we put a secondary check here, I wouldn't want
> dsm_impl_op to behave in an undefined manner when asked to resize
> under an implementation that can't. And there doesn't seem to be much
> point in having two checks.

Well, You have the check in dsm_remap() which seems strange to me.

> >> +void
> >> +dsm_detach(dsm_segment *seg)
> >> +{
> >
> > Why do we want to ignore errors like failing to unmap? ISTM that
> > indicates an actual problem...
>
> Sure it does. But what are you going to do about it? In many cases,
> you're going to get here during a transaction abort caused by some
> other error. If the transaction is already aborting, throwing an
> error here will just cause the original error to get discarded in
> favor of showing this one, or maybe it's the other way around. I
> don't remember, but it's definitely one or the other, and neither is
> desirable. Throwing a warning, on the other hand, will notify the
> user, which is what we want.
>
> Now on the flip side we might not be aborting; maybe we're committing.
> But we don't want to turn a commit into an abort just for this. If
> resowner.c detects a buffer pin leak or a tuple descriptor leak, those
> are "just" warning as well. They're serious warnings, of course, and
> if they happen it means there's a bug in the code that needs to be
> fixed. But the severity of an ereport() isn't based just on how
> alarming the situation is; it's based on what you want to happen when
> that situation comes up. And we've decided (correctly, I think) that
> resource leaks are not grounds for aborting a transaction that
> otherwise would have committed.

We're not talking about a missed munmap() but about one that failed. If
we unpin the leaked pins and notice that we haven't actually pinned it
anymore we do error (well, Assert) out. Same for TupleDescs.

If there were valid scenarios in which you could get into that
situation, maybe. But which would that be? ISTM we can only get there if
our internal state is messed up.

> >> + * several implementations of this facility. Many systems implement
> >> + * POSIX shared memory (shm_open etc.), which is well-suited to our needs
> >> + * in this area, with the exception that shared memory identifiers live
> >> + * in a flat system-wide namespace, raising the uncomfortable prospect of
> >> + * name collisions with other processes (including other copies of
> >> + * PostgreSQL) running on the same system.
> >
> > Why isn't the port number part of the posix shmem identifiers? Sure, we
> > retry, but using a logic similar to sysv_shmem.c seems like a good idea.
>
> According to the man page for shm_open on Solaris, "For maximum
> portability, name should include no more than 14 characters, but this
> limit is not enforced."

What about "/pgsql.%u" or something similar? That should still fit.

> >> + /*
> >> + * If we're reattaching or resizing, we must remove any existing mapping,
> >> + * unless we've already got the right thing mapped.
> >> + */
> >> + if (*mapped_address != NULL)
> >> + {
> >> + if (*mapped_size == request_size)
> >> + return true;
> >
> > Hm. It could have gotten resized to the old size, or resized twice. In
> > that case it might not be at the same address before, so checking for
> > the size doesn't seem to be sufficient.
>
> I don't understand your concern. If someone resize the DSM to its
> already-current size, there is no need to remap it. The old mapping
> is just fine. And if some other backend resizes the DSM to a larger
> size and then back to the original size, and then we're asked to
> update the mapping, there is no need to change anything.

Yes, forget what I said. I was confusing myself.

> >> +static int
> >> +errcode_for_dynamic_shared_memory()
> >> +{
> >> + if (errno == EFBIG || errno == ENOMEM)
> >> + return errcode(ERRCODE_OUT_OF_MEMORY);
> >> + else
> >> + return errcode_for_file_access();
> >> +}
> >
> > Is EFBIG guaranteed to be defined?
>
> I dunno. We could put an #ifdef around that part. Should we do that
> now or wait and see if it actually breaks anywhere?

A bit of googling around seems to indicate it's likely to be
available. Even on windows according to MSDN.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-22 07:47:52
Message-ID:	CAA4eK1JrBB-c+TDFBZL0og0=ftsS+5_WfmDVuqjRV6-d_d0qxg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 20, 2013 at 5:14 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi,
>
>
> On 2013-09-19 11:44:34 -0400, Robert Haas wrote:
>> On Wed, Sep 18, 2013 at 1:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>
>> >> --- /dev/null
>> >> +++ b/src/backend/storage/ipc/dsm.c
>> >> +#define PG_DYNSHMEM_STATE_FILE PG_DYNSHMEM_DIR "/state"
>> >> +#define PG_DYNSHMEM_NEW_STATE_FILE PG_DYNSHMEM_DIR "/state.new"
>> >
>> > Hm, I guess you dont't want to add it to global/ or so because of the
>> > mmap implementation where you presumably scan the directory?
>>
>> Yes, and also because I thought this way would make it easier to teach
>> things like pg_basebackup (or anybody's home-brew scripts) to just
>> skip that directory completely. Actually, I was wondering if we ought
>> to have a directory under pgdata whose explicit charter it was to
>> contain files that shouldn't be copied as part of a base backup.
>> pg_do_not_back_this_up.
>
> Wondered exactly about that as soon as you've mentioned
> pg_basebackup. pg_local/?
>
>> >> + /* Determine size for new control segment. */
>> >> + maxitems = PG_DYNSHMEM_FIXED_SLOTS
>> >> + + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;
>> >
>> > It seems likely that MaxConnections would be sufficient?
>>
>> I think we could argue about the best way to set this until the cows
>> come home, but I don't think it probably matters much at this point.
>> We can always change the formula later as we gain experience.
>> However, I don't have a principled reason for assuming that only
>> user-connected backends will create dynamic shared memory segments.
>
> Hm, yes. I had MaxBackends down as MaxConnections + autovacuum stuff;
> but max_worker_processes are in there now, so you're right that doesn't
> make sense.
>
>> >> +/*
>> >> + * Read and parse the state file.
>> >> + *
>> >
>> > Perhaps CRC32 the content?
>>
>> I don't see the point. If the file contents are garbage that happens
>> to look like a number, we'll go "oh, there isn't any such segment" or
>> "oh, there is such a segment but it doesn't look like a control
>> segment, so forget it". There are a lot of things we really ought to
>> be CRCing to avoid corruption risk, but I can't see how this is
>> remotely one of them.
>
> I was worried about a partially written file or containing contents from
> two different postmaster cycles, but it's actually far too small for
> that...
> I initially had thought you'd write the contents of the entire shared
> control segment there, not just it's id.
>
>> >> + /* Create or truncate the file. */
>> >> + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
>> >
>> > Doesn't this need a | PG_BINARY?
>>
>> It's a text file. Do we need PG_BINARY anyway?
>
> I'd say yes. Non binary mode stuff on windows does stuff like
> transforming LF <=> CRLF on reading/writing, which makes sizes not match
> up and similar ugliness.
> Imo there's little reason to use non-binary mode for anything written
> for postgres' own consumption.

On checking about this in code, I found the below comment which
suggests LF<=> CRLF is not an issue (in windows it uses pgwin32_open
to open a file):

* NOTE: this is also used for opening text files.

* WIN32 treats Control-Z as EOF in files opened in text mode.

* Therefore, we open files in binary mode on Win32 so we can read

* literal control-Z. The other affect is that we see CRLF, but

* that is OK because we can already handle those cleanly.

Second instance, I noticed in code as below which again suggests CRLF
should not be an issue until file mode is specifically set to TEXT
mode which is not the case with current usage of open in dynamic
shared memory code.

#ifdef WIN32

/* use CRLF line endings on Windows */

_setmode(_fileno(fh), _O_TEXT);

#endif

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-22 19:04:15
Message-ID:	20130922190415.GA9218@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Sep 22, 2013 at 01:17:52PM +0530, Amit Kapila wrote:
> On Fri, Sep 20, 2013 at 5:14 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2013-09-19 11:44:34 -0400, Robert Haas wrote:
> >> On Wed, Sep 18, 2013 at 1:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> >> + /* Create or truncate the file. */
> >> >> + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
> >> >
> >> > Doesn't this need a | PG_BINARY?
> >>
> >> It's a text file. Do we need PG_BINARY anyway?
> >
> > I'd say yes. Non binary mode stuff on windows does stuff like
> > transforming LF <=> CRLF on reading/writing, which makes sizes not match
> > up and similar ugliness.
> > Imo there's little reason to use non-binary mode for anything written
> > for postgres' own consumption.
>
> On checking about this in code, I found the below comment which
> suggests LF<=> CRLF is not an issue (in windows it uses pgwin32_open
> to open a file):
>
> /*
> * NOTE: this is also used for opening text files.
> * WIN32 treats Control-Z as EOF in files opened in text mode.
> * Therefore, we open files in binary mode on Win32 so we can read
> * literal control-Z. The other affect is that we see CRLF, but
> * that is OK because we can already handle those cleanly.
> */

That comment appears at the definition of PG_BINARY. You only get what it
describes when you use PG_BINARY.

> Second instance, I noticed in code as below which again suggests CRLF
> should not be an issue until file mode is specifically set to TEXT
> mode which is not the case with current usage of open in dynamic
> shared memory code.
>
> #ifdef WIN32
> /* use CRLF line endings on Windows */
> _setmode(_fileno(fh), _O_TEXT);
> #endif

I suspect that call (in logfile_open()) has no effect. The file is already in
text mode.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-23 05:13:33
Message-ID:	CAA4eK1LjBsnYiR1TRp-AYmY9qmb2Ury6hdGwprib-NwvAymNQw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 23, 2013 at 12:34 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Sun, Sep 22, 2013 at 01:17:52PM +0530, Amit Kapila wrote:
>> On Fri, Sep 20, 2013 at 5:14 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > On 2013-09-19 11:44:34 -0400, Robert Haas wrote:
>> >> On Wed, Sep 18, 2013 at 1:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> >> >> + /* Create or truncate the file. */
>> >> >> + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
>> >> >
>> >> > Doesn't this need a | PG_BINARY?
>> >>
>> >> It's a text file. Do we need PG_BINARY anyway?
>> >
>> > I'd say yes. Non binary mode stuff on windows does stuff like
>> > transforming LF <=> CRLF on reading/writing, which makes sizes not match
>> > up and similar ugliness.
>> > Imo there's little reason to use non-binary mode for anything written
>> > for postgres' own consumption.
>>
>> On checking about this in code, I found the below comment which
>> suggests LF<=> CRLF is not an issue (in windows it uses pgwin32_open
>> to open a file):
>>
>> /*
>> * NOTE: this is also used for opening text files.
>> * WIN32 treats Control-Z as EOF in files opened in text mode.
>> * Therefore, we open files in binary mode on Win32 so we can read
>> * literal control-Z. The other affect is that we see CRLF, but
>> * that is OK because we can already handle those cleanly.
>> */
>
> That comment appears at the definition of PG_BINARY. You only get what it
> describes when you use PG_BINARY.
>
>> Second instance, I noticed in code as below which again suggests CRLF
>> should not be an issue until file mode is specifically set to TEXT
>> mode which is not the case with current usage of open in dynamic
>> shared memory code.
>>
>> #ifdef WIN32
>> /* use CRLF line endings on Windows */
>> _setmode(_fileno(fh), _O_TEXT);
>> #endif
>
> I suspect that call (in logfile_open()) has no effect. The file is already in
> text mode.

Won't this be required when we have to open a new file due to log
rotation based on time?

The basic point, I wanted to make is that until you use _O_TEXT mode
explicitly, the problem LF<=>CRLF will not happen. CreateFile() API
which is used for windows implementation of open doesn't take any
parameter which specifies it as text or binary, only by using
_setmode, we need to set the file mode as Text or Binary.
I checked fcntl.h where there is below comment above definition of
_O_TEXT and _O_BINARY which again is pointing to what I said above.
/* O_TEXT files have <cr><lf> sequences translated to <lf> on read()'s,
** and <lf> sequences translated to <cr><lf> on write()'s
*/

One more point, this issue has only chance of occurring when somebody
takes the file from unix system to windows and then may be back, do
you think dsm state file should be allowed in cross platform backup, I
think pg_basebackup should disallow the backup of this file.
However user can use some other custom utility to take filesystem
level backup where this can happen, but still as per my understanding
it should not create problem.

I think to be on safe side we can use PG_BINARY, but it would be
better if we are sure that this problem can occur then only we should
use it.
If you think cross platform backup's can create such issues, then I
can once test this as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-23 19:02:01
Message-ID:	20130923190201.GA16839@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 23, 2013 at 10:43:33AM +0530, Amit Kapila wrote:
> On Mon, Sep 23, 2013 at 12:34 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Sun, Sep 22, 2013 at 01:17:52PM +0530, Amit Kapila wrote:
> >> #ifdef WIN32
> >> /* use CRLF line endings on Windows */
> >> _setmode(_fileno(fh), _O_TEXT);
> >> #endif
> >
> > I suspect that call (in logfile_open()) has no effect. The file is already in
> > text mode.
>
> Won't this be required when we have to open a new file due to log
> rotation based on time?
>
> The basic point, I wanted to make is that until you use _O_TEXT mode
> explicitly, the problem LF<=>CRLF will not happen. CreateFile() API
> which is used for windows implementation of open doesn't take any
> parameter which specifies it as text or binary, only by using
> _setmode, we need to set the file mode as Text or Binary.

You are indeed correct. I had assumed that pgwin32_open() does not change the
usual Windows open()/fopen() behavior concerning line endings. No code
comment mentions otherwise, and that would make pro forma our pervasive use of
PG_BINARY. Nonetheless, it behaves as you say. I wonder if that was
intentional, and I wonder if the outcome varies between Visual Studio versions
(I tested with VS2010).

> I checked fcntl.h where there is below comment above definition of
> _O_TEXT and _O_BINARY which again is pointing to what I said above.
> /* O_TEXT files have <cr><lf> sequences translated to <lf> on read()'s,
> ** and <lf> sequences translated to <cr><lf> on write()'s
> */

However, O_TEXT is the default in a normal Windows program:
http://msdn.microsoft.com/en-us/library/ktss1a9b.aspx

> I think to be on safe side we can use PG_BINARY, but it would be
> better if we are sure that this problem can occur then only we should
> use it.
> If you think cross platform backup's can create such issues, then I
> can once test this as well.

I don't know whether writing it as binary will help or hurt that situation.
If nothing else, binary gives you one less variation to think about when
studying the code. Anyone sophisticated enough to meaningfully examine the
file will have no trouble dealing with either line ending convention.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-24 03:28:36
Message-ID:	CAA4eK1LH0qj4oYWKumQjROhKsQB7E4JrRgD1H1VEu4WzFdwW9g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 24, 2013 at 12:32 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Mon, Sep 23, 2013 at 10:43:33AM +0530, Amit Kapila wrote:
>> On Mon, Sep 23, 2013 at 12:34 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> > On Sun, Sep 22, 2013 at 01:17:52PM +0530, Amit Kapila wrote:
>> >> #ifdef WIN32
>> >> /* use CRLF line endings on Windows */
>> >> _setmode(_fileno(fh), _O_TEXT);
>> >> #endif
>> >
>> > I suspect that call (in logfile_open()) has no effect. The file is already in
>> > text mode.
>>
>> Won't this be required when we have to open a new file due to log
>> rotation based on time?
>>
>> The basic point, I wanted to make is that until you use _O_TEXT mode
>> explicitly, the problem LF<=>CRLF will not happen. CreateFile() API
>> which is used for windows implementation of open doesn't take any
>> parameter which specifies it as text or binary, only by using
>> _setmode, we need to set the file mode as Text or Binary.
>
> You are indeed correct. I had assumed that pgwin32_open() does not change the
> usual Windows open()/fopen() behavior concerning line endings. No code
> comment mentions otherwise, and that would make pro forma our pervasive use of
> PG_BINARY.

The only form of comment to give an indication (or atleast I got the
indication from there) is what I have mentioned in above mail chain
at top of PG_BINARY. I understand that it is not very clear in that
comment about the actual handling of CRLF issue.

> Nonetheless, it behaves as you say. I wonder if that was
> intentional, and I wonder if the outcome varies between Visual Studio versions
> (I tested with VS2010).

Ideally it should not the depend on VS version as outcome depends on
API (CreateFile/setmode) usage.

>> I checked fcntl.h where there is below comment above definition of
>> _O_TEXT and _O_BINARY which again is pointing to what I said above.
>> /* O_TEXT files have <cr><lf> sequences translated to <lf> on read()'s,
>> ** and <lf> sequences translated to <cr><lf> on write()'s
>> */
>
> However, O_TEXT is the default in a normal Windows program:
> http://msdn.microsoft.com/en-us/library/ktss1a9b.aspx
>> I think to be on safe side we can use PG_BINARY, but it would be
>> better if we are sure that this problem can occur then only we should
>> use it.
>> If you think cross platform backup's can create such issues, then I
>> can once test this as well.
>
> I don't know whether writing it as binary will help or hurt that situation.
> If nothing else, binary gives you one less variation to think about when
> studying the code.

In that case, shouldn't all other places be consistent. One reason I
had in mind for
using appropriate mode is that somebody reading code can tomorrow come
up with a question or a
patch to use correct mode, then we will again be in same situation.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-24 16:19:51
Message-ID:	CA+TgmoYzo_3m8WQej3MAk7bKD3tw1GHNbFWga576aEGitwWTUQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > Hm, I guess you dont't want to add it to global/ or so because of the
>> > mmap implementation where you presumably scan the directory?
>>
>> Yes, and also because I thought this way would make it easier to teach
>> things like pg_basebackup (or anybody's home-brew scripts) to just
>> skip that directory completely. Actually, I was wondering if we ought
>> to have a directory under pgdata whose explicit charter it was to
>> contain files that shouldn't be copied as part of a base backup.
>> pg_do_not_back_this_up.
>
> Wondered exactly about that as soon as you've mentioned
> pg_basebackup. pg_local/?

That seems reasonable. It's not totally transparent what that's
supposed to mean, but it's fairly mnemonic once you know. Other
suggestions welcome, if anyone has ideas.

Are there any other likely candidates for inclusion in that directory
other than this stuff?

>> >> + /* Create or truncate the file. */
>> >> + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
>> >
>> > Doesn't this need a | PG_BINARY?
>>
>> It's a text file. Do we need PG_BINARY anyway?
>
> I'd say yes. Non binary mode stuff on windows does stuff like
> transforming LF <=> CRLF on reading/writing, which makes sizes not match
> up and similar ugliness.
> Imo there's little reason to use non-binary mode for anything written
> for postgres' own consumption.

Well, I'm happy to do whatever the consensus is. AFAICT you and Noah
are both for it and Amit's position is that it doesn't matter either
way, so I'll go ahead and change that unless further discussion sheds
a different light on things.

>> > Why are you using open() and not
>> > BasicOpenFile or even OpenTransientFile?
>>
>> Because those don't work in the postmaster.
>
> Oh, that's news to me. Seems strange, especially for BasicOpenFile.

Per its header comment, InitFileAccess is not called in the
postmaster, so there's no VFD cache. Thus, any attempt by
BasicOpenFile to call ReleaseLruFile would be pointless at best.

>> dsm_handle is an alias for uint32. Is that always exactly an unsigned
>> int or can it sometimes be an unsigned long? I thought the latter, so
>> couldn't figure out how to write this portably without casting to a
>> type that explicitly matched the format string.
>
> That should always be an unsigned int on platforms we support. Note that
> we've printed TransactionIds that way (i.e. %u) for a long time and they
> are a uint32 as well.

Fixed.

>> > Not sure whether it's sensible to only LOG in these cases. After all
>> > there's something unexpected happening. The robustness argument doesn't
>> > count since we're already shutting down.
>
>> I see no point in throwing an error. The fact that we're having
>> trouble cleaning up one dynamic shared memory segment doesn't mean we
>> shouldn't try to clean up others, or that any remaining postmaster
>> shutdown hooks shouldn't be executed.
>
> Well, it means we'll do a regular shutdown instead of PANICing
> and *not* writing a checkpoint.
> If something has corrupted our state to the point we cannot unregister
> shared memory we registered, something has gone terribly wrong. Quite
> possibly we've scribbled over our control structures or such. In that
> case it's not proper to do a normal shutdown, we're quite possibly
> writing bad data.

I have to admit I didn't consider the possibility of an
otherwise-clean shutdown that hit only this problem. I'm not sure how
seriously to take that case. I guess we could emit warnings for
individual failures and then throw an error at the end if there were >
0, but that seems a little ugly. Or we could go whole hog and treat
any failure as a critical error. Anyone else have an opinion on what
to do here?

>> > Imo that's a PANIC or at the very least a FATAL.
>>
>> Sure, that's a tempting option, but it doesn't seem to serve any very
>> necessary point. There's no data corruption problem if we proceed
>> here. Most likely either (1) there's a bug in the code, which
>> panicking won't fix or (2) the DBA hand-edited the state file, in
>> which case maybe he shouldn't have done that, but if he thinks the
>> best way to recover from that is a cluster-wide restart, he can do
>> that himself.
>
> "There's no data corruption problem if we proceed" - but there likely
> has been one leading to the current state.

I doubt it. It's more likely that the file permissions got changed or
something.

>> > Do we rely on being run in an environment with proper setup for lwlock
>> > cleanup? I can imagine shared libraries doing this pretty early on...
>>
>> Yes, we rely on that. I don't really see that as a problem. You'd
>> better connect to the main shared memory segment before starting to
>> create your own.
>
> I am not talking about lwlocks itself being setup but an environment
> that has resource owners defined and catches errors. I am specifically
> asking because you're a) ereport()ing without releasing an LWLock b)
> unconditionally relying on the fact that there's a current resource
> owner.
> In shared_preload_libraries neither is the case afair?
>
> Now, you could very well argue that you don't need to use dsm for
> shared_preload_libraries but there are enough libraries that you can use
> per session or globally. Requiring them to use both implementation or
> register stuff later seems like it would complicate things.

Well, the cleanup logic gets a lot more complicated without a resource
owner. The first few drafts of this logic didn't involve any resource
owner integration and things got quite a bit simpler and nicer when I
added that. For example, in dsm_create(), we do
dsm_create_descriptor() and then loop until we find an unused segment
identifier. If dsm_impl_op() throws an ERROR, which it definitely
can, then the segment produced by dsm_create_descriptor() is still
lying around in dsm_segment_list, and without the resource owner
machinery, that's a permanent leak. Certainly, it's fixable. You can
put PG_TRY() blocks in everywhere and solve the problem that way. But
I'm not very keen on going that route; it looks like it will be
painful and messy.

I also do not think that allocating dynamic shared memory segments in
shared_preload_libraries is actually sensible. You're in the
postmaster at that point, and the main shared memory segment is not
set up. If you were to map a shared memory segment at that point, the
mapping would get inherited in EXEC_BACKEND environments but not
otherwise, so we'd need more infrastructure to handle that. And, of
course, we couldn't use LWLocks for synchronization. I think that we
couldn't use spinlocks either, even if it were otherwise acceptable,
since with --disable-spinlocks those are going to turn into semaphores
that I don't think are available at this point either.

I don't really feel like solving all of those problems and, TBH, I
don't see why it's particularly important. If somebody wants a
loadable module that can be loaded either from
shared_preload_libraries or on the fly, and they use dynamic shared
memory in the latter case, then they can use it in the former case as
well. If they've already got logic to create the DSM when it's first
needed, it doesn't cost extra to do it that way in both cases.

>> They'll continue to see the portion they have mapped, but must do
>> dsm_remap() if they want to see the whole thing.
>
> But resizing can shrink, can it not? And we do an ftruncate() in at
> least the posix shmem case. Which means the other backend will get a
> SIGSEGV accessing that memory IIRC.

Yep. Shrinking the shared memory segment will require special
caution. Caveat emptor.

>> > Shouldn't we error out if !dsm_impl_can_resize()?
>>
>> The implementation-specific code throws an error if it can't support
>> resize. Even if we put a secondary check here, I wouldn't want
>> dsm_impl_op to behave in an undefined manner when asked to resize
>> under an implementation that can't. And there doesn't seem to be much
>> point in having two checks.
>
> Well, You have the check in dsm_remap() which seems strange to me.

Oh, fair point. Removed.

>> Now on the flip side we might not be aborting; maybe we're committing.
>> But we don't want to turn a commit into an abort just for this. If
>> resowner.c detects a buffer pin leak or a tuple descriptor leak, those
>> are "just" warning as well. They're serious warnings, of course, and
>> if they happen it means there's a bug in the code that needs to be
>> fixed. But the severity of an ereport() isn't based just on how
>> alarming the situation is; it's based on what you want to happen when
>> that situation comes up. And we've decided (correctly, I think) that
>> resource leaks are not grounds for aborting a transaction that
>> otherwise would have committed.
>
> We're not talking about a missed munmap() but about one that failed. If
> we unpin the leaked pins and notice that we haven't actually pinned it
> anymore we do error (well, Assert) out. Same for TupleDescs.
>
> If there were valid scenarios in which you could get into that
> situation, maybe. But which would that be? ISTM we can only get there if
> our internal state is messed up.

I don't know. I think that's part of why it's hard to decide what we
want to happen. But personally I think it's paranoid to say, well,
something happened that we weren't expecting, so that must mean
something totally horrible has happened and we'd better die in a fire.
I mean, the fact that the checks you are talking about are assertions
means that they are scenarios we expect never to happen, and therefore
we don't even check for them in a production build. I don't think you
can use that as a precedent to show that any failure here is an
automatic PANIC.

>> > Why isn't the port number part of the posix shmem identifiers? Sure, we
>> > retry, but using a logic similar to sysv_shmem.c seems like a good idea.
>>
>> According to the man page for shm_open on Solaris, "For maximum
>> portability, name should include no more than 14 characters, but this
>> limit is not enforced."
>
> What about "/pgsql.%u" or something similar? That should still fit.

Well, if you want both the port and the identifier in there, that
doesn't get you there.

>> >> +static int
>> >> +errcode_for_dynamic_shared_memory()
>> >> +{
>> >> + if (errno == EFBIG || errno == ENOMEM)
>> >> + return errcode(ERRCODE_OUT_OF_MEMORY);
>> >> + else
>> >> + return errcode_for_file_access();
>> >> +}
>> >
>> > Is EFBIG guaranteed to be defined?
>>
>> I dunno. We could put an #ifdef around that part. Should we do that
>> now or wait and see if it actually breaks anywhere?
>
> A bit of googling around seems to indicate it's likely to be
> available. Even on windows according to MSDN.

Cool.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-24 18:19:53
Message-ID:	20130924181953.GC11521@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-09-24 12:19:51 -0400, Robert Haas wrote:
> On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> > Hm, I guess you dont't want to add it to global/ or so because of the
> >> > mmap implementation where you presumably scan the directory?
> >>
> >> Yes, and also because I thought this way would make it easier to teach
> >> things like pg_basebackup (or anybody's home-brew scripts) to just
> >> skip that directory completely. Actually, I was wondering if we ought
> >> to have a directory under pgdata whose explicit charter it was to
> >> contain files that shouldn't be copied as part of a base backup.
> >> pg_do_not_back_this_up.
> >
> > Wondered exactly about that as soon as you've mentioned
> > pg_basebackup. pg_local/?
>
> That seems reasonable. It's not totally transparent what that's
> supposed to mean, but it's fairly mnemonic once you know. Other
> suggestions welcome, if anyone has ideas.

pg_node_local/ was the only reasonable thing I could think of otherwise,
and I disliked it because it seems we shouldn't introduce "node" as a
term just for this.

> Are there any other likely candidates for inclusion in that directory
> other than this stuff?

You could argue that pg_stat_tmp/ is one.

> >> > Why are you using open() and not
> >> > BasicOpenFile or even OpenTransientFile?
> >>
> >> Because those don't work in the postmaster.
> >
> > Oh, that's news to me. Seems strange, especially for BasicOpenFile.

> Per its header comment, InitFileAccess is not called in the
> postmaster, so there's no VFD cache. Thus, any attempt by
> BasicOpenFile to call ReleaseLruFile would be pointless at best.

Well, but it makes code running in both backends and postmaster easier
to write. Good enough for me anyway.

> >> > Imo that's a PANIC or at the very least a FATAL.
> >>
> >> Sure, that's a tempting option, but it doesn't seem to serve any very
> >> necessary point. There's no data corruption problem if we proceed
> >> here. Most likely either (1) there's a bug in the code, which
> >> panicking won't fix or (2) the DBA hand-edited the state file, in
> >> which case maybe he shouldn't have done that, but if he thinks the
> >> best way to recover from that is a cluster-wide restart, he can do
> >> that himself.
> >
> > "There's no data corruption problem if we proceed" - but there likely
> > has been one leading to the current state.
>
> I doubt it. It's more likely that the file permissions got changed or
> something.

We panic in that case during a shutdown, don't we? ... Yep:
PANIC: could not open control file "global/pg_control": Permission denied

> > I am not talking about lwlocks itself being setup but an environment
> > that has resource owners defined and catches errors. I am specifically
> > asking because you're a) ereport()ing without releasing an LWLock b)
> > unconditionally relying on the fact that there's a current resource
> > owner.
> > In shared_preload_libraries neither is the case afair?

> I don't really feel like solving all of those problems and, TBH, I
> don't see why it's particularly important. If somebody wants a
> loadable module that can be loaded either from
> shared_preload_libraries or on the fly, and they use dynamic shared
> memory in the latter case, then they can use it in the former case as
> well. If they've already got logic to create the DSM when it's first
> needed, it doesn't cost extra to do it that way in both cases.

Fair enough.

> >> They'll continue to see the portion they have mapped, but must do
> >> dsm_remap() if they want to see the whole thing.
> >
> > But resizing can shrink, can it not? And we do an ftruncate() in at
> > least the posix shmem case. Which means the other backend will get a
> > SIGSEGV accessing that memory IIRC.

> Yep. Shrinking the shared memory segment will require special
> caution. Caveat emptor.

Then a comment to that effect would be nice.

> > We're not talking about a missed munmap() but about one that failed. If
> > we unpin the leaked pins and notice that we haven't actually pinned it
> > anymore we do error (well, Assert) out. Same for TupleDescs.
> >
> > If there were valid scenarios in which you could get into that
> > situation, maybe. But which would that be? ISTM we can only get there if
> > our internal state is messed up.

> I don't know. I think that's part of why it's hard to decide what we
> want to happen. But personally I think it's paranoid to say, well,
> something happened that we weren't expecting, so that must mean
> something totally horrible has happened and we'd better die in a fire.

Well, by that argument we wouldn't need to PANIC on a whole host of
issues. Like segfaults.

Anyway, I guess we need other opinions here.

> >> > Why isn't the port number part of the posix shmem identifiers? Sure, we
> >> > retry, but using a logic similar to sysv_shmem.c seems like a good idea.
> >>
> >> According to the man page for shm_open on Solaris, "For maximum
> >> portability, name should include no more than 14 characters, but this
> >> limit is not enforced."
> >
> > What about "/pgsql.%u" or something similar? That should still fit.
>
> Well, if you want both the port and the identifier in there, that
> doesn't get you there.

Port seems enough to start with - most machines are configured to only
have one cluster starting on one port. That way we wouldn't always get
conflicts but just if somebody does something crazy.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-25 03:18:23
Message-ID:	CAA4eK1K1bKkeiODu2N4A0Nc74E6LX5MA82=_J8wXk+=bGeZOhg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 24, 2013 at 9:49 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>> > Hm, I guess you dont't want to add it to global/ or so because of the
>>> > mmap implementation where you presumably scan the directory?
>>>
>>> Yes, and also because I thought this way would make it easier to teach
>>> things like pg_basebackup (or anybody's home-brew scripts) to just
>>> skip that directory completely. Actually, I was wondering if we ought
>>> to have a directory under pgdata whose explicit charter it was to
>>> contain files that shouldn't be copied as part of a base backup.
>>> pg_do_not_back_this_up.
>>
>> Wondered exactly about that as soon as you've mentioned
>> pg_basebackup. pg_local/?
>
> That seems reasonable. It's not totally transparent what that's
> supposed to mean, but it's fairly mnemonic once you know. Other
> suggestions welcome, if anyone has ideas.
>
> Are there any other likely candidates for inclusion in that directory
> other than this stuff?
pgsql_tmp. Refer sendDir() in basebackup.c, there we avoid sending
files in backup.
Some of future features like ALTER SYSTEM, can also use it for tmp file.

>>> >> + /* Create or truncate the file. */
>>> >> + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
>>> >
>>> > Doesn't this need a | PG_BINARY?
>>>
>>> It's a text file. Do we need PG_BINARY anyway?
>>
>> I'd say yes. Non binary mode stuff on windows does stuff like
>> transforming LF <=> CRLF on reading/writing, which makes sizes not match
>> up and similar ugliness.
>> Imo there's little reason to use non-binary mode for anything written
>> for postgres' own consumption.
>
> Well, I'm happy to do whatever the consensus is. AFAICT you and Noah
> are both for it and Amit's position is that it doesn't matter either
> way

I am sorry If my mails doesn't say that I am in favour of keeping code
as it is unless there is really a case which requires it.
Basically as per my understanding, I have presented some facts in
above mails which indicates, there is no need for PG_BINARY in this
case.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-26 13:27:13
Message-ID:	20130926132713.GB47518@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 24, 2013 at 12:19:51PM -0400, Robert Haas wrote:
> On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> Actually, I was wondering if we ought
> >> to have a directory under pgdata whose explicit charter it was to
> >> contain files that shouldn't be copied as part of a base backup.
> >> pg_do_not_back_this_up.
> >
> > Wondered exactly about that as soon as you've mentioned
> > pg_basebackup. pg_local/?
>
> That seems reasonable. It's not totally transparent what that's
> supposed to mean, but it's fairly mnemonic once you know. Other
> suggestions welcome, if anyone has ideas.

I like the concept and have no improvements on the name.

> Are there any other likely candidates for inclusion in that directory
> other than this stuff?

pg_xlog

> >> > Not sure whether it's sensible to only LOG in these cases. After all
> >> > there's something unexpected happening. The robustness argument doesn't
> >> > count since we're already shutting down.
> >
> >> I see no point in throwing an error. The fact that we're having
> >> trouble cleaning up one dynamic shared memory segment doesn't mean we
> >> shouldn't try to clean up others, or that any remaining postmaster
> >> shutdown hooks shouldn't be executed.
> >
> > Well, it means we'll do a regular shutdown instead of PANICing
> > and *not* writing a checkpoint.
> > If something has corrupted our state to the point we cannot unregister
> > shared memory we registered, something has gone terribly wrong. Quite
> > possibly we've scribbled over our control structures or such. In that
> > case it's not proper to do a normal shutdown, we're quite possibly
> > writing bad data.
>
> I have to admit I didn't consider the possibility of an
> otherwise-clean shutdown that hit only this problem. I'm not sure how
> seriously to take that case. I guess we could emit warnings for
> individual failures and then throw an error at the end if there were >
> 0, but that seems a little ugly. Or we could go whole hog and treat
> any failure as a critical error. Anyone else have an opinion on what
> to do here?

There's extensive precedent in our code for LOG, WARNING, or even ignoring the
return value of unlink(). (To my surprise, ignoring the return value is the
most popular choice.) Of the dozens of backend callers, here is the mixed bag
that actually raises ERROR or better:

do_pg_stop_backup()
RestoreArchivedFile()
KeepFileRestoredFromArchive()
create_tablespace_directories() [remove old symlink during recovery]
destroy_tablespace_directories()
RelationCacheInitFilePreInvalidate()
CreateLockFile()

I think it's awfully unlikely that runaway code would corrupt shared_buffers
AND manage to make an unlink() fail.

+1 for making this one a PANIC, though. With startup behind us, a valid dsm
state file pointed us to a control segment with bogus contents. The
conditional probability of shared memory corruption seems higher than that of
a DBA editing the dsm state file of a running cluster to incorrectly name as
the dsm control segment some other existing shared memory segment.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-26 18:45:21
Message-ID:	524480C1.7080405@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 9/26/13 8:27 AM, Noah Misch wrote:
> On Tue, Sep 24, 2013 at 12:19:51PM -0400, Robert Haas wrote:
>> >On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund<andres(at)2ndquadrant(dot)com> wrote:
>>>> > >>Actually, I was wondering if we ought
>>>> > >>to have a directory under pgdata whose explicit charter it was to
>>>> > >>contain files that shouldn't be copied as part of a base backup.
>>>> > >>pg_do_not_back_this_up.
>>> > >
>>> > >Wondered exactly about that as soon as you've mentioned
>>> > >pg_basebackup. pg_local/?
>> >
>> >That seems reasonable. It's not totally transparent what that's
>> >supposed to mean, but it's fairly mnemonic once you know. Other
>> >suggestions welcome, if anyone has ideas.
> I like the concept and have no improvements on the name.
>
>> >Are there any other likely candidates for inclusion in that directory
>> >other than this stuff?
> pg_xlog

Isn't it also pointless to backup temp objects as well as non-logged tables?

Or is the purpose of pg_local to be a home for things that MUST NOT be backed up as opposed to items where backing them up is pointless?
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-26 19:58:07
Message-ID:	CA+TgmoaiAVQhbWqHYNXR-0yysH4WxGH2W0WFdtoVufairmpqPA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 26, 2013 at 2:45 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> On 9/26/13 8:27 AM, Noah Misch wrote:
>>
>> On Tue, Sep 24, 2013 at 12:19:51PM -0400, Robert Haas wrote:
>>>
>>> >On Fri, Sep 20, 2013 at 7:44 AM, Andres Freund<andres(at)2ndquadrant(dot)com>
>>> > wrote:
>>>>>
>>>>> > >>Actually, I was wondering if we ought
>>>>> > >>to have a directory under pgdata whose explicit charter it was to
>>>>> > >>contain files that shouldn't be copied as part of a base backup.
>>>>> > >>pg_do_not_back_this_up.
>>>>
>>>> > >
>>>> > >Wondered exactly about that as soon as you've mentioned
>>>> > >pg_basebackup. pg_local/?
>>>
>>> >
>>> >That seems reasonable. It's not totally transparent what that's
>>> >supposed to mean, but it's fairly mnemonic once you know. Other
>>> >suggestions welcome, if anyone has ideas.
>>
>> I like the concept and have no improvements on the name.
>>
>>> >Are there any other likely candidates for inclusion in that directory
>>> >other than this stuff?
>>
>> pg_xlog
>
>
> Isn't it also pointless to backup temp objects as well as non-logged tables?
>
> Or is the purpose of pg_local to be a home for things that MUST NOT be
> backed up as opposed to items where backing them up is pointless?

I don't know. I found it surprising that Noah suggested including
pg_xlog; that's certainly not "node-local state" in any meaningful
sense, the way dsm stuff is. But I guess if pg_basebackup excludes it
it arguably qualifies. However, it'd be nice to advertise that
pg_local can be cleared when the server is shut down, and you
certainly CaN nOt do that to pg_xlog. Whee!

I think the way I'd summarize the state of this patch is that everyone
who has looked at it more or less agrees that the big picture is
right, but there are details, which I think everyone admits are pretty
much corner cases, where different people have different ideas about
which way to go and what risks and benefits might be thereby incurred.
I'll not pretend that my opinions are categorically better than any
others, but of course I like them because they are mine. Figuring out
just what to do to about those last details seems challenging; no
answer will completely please everyone.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-09-30 23:33:51
Message-ID:	20130930233351.GB125986@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 24, 2013 at 08:58:36AM +0530, Amit Kapila wrote:
> On Tue, Sep 24, 2013 at 12:32 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > I don't know whether writing it as binary will help or hurt that situation.
> > If nothing else, binary gives you one less variation to think about when
> > studying the code.
>
> In that case, shouldn't all other places be consistent. One reason I
> had in mind for
> using appropriate mode is that somebody reading code can tomorrow come
> up with a question or a
> patch to use correct mode, then we will again be in same situation.

There are cases that must use binary I/O (table data files), cases that
benefit notably from text I/O (log files, postgresql.conf), and cases where it
doesn't matter too much (dsm state file, postmaster.pid). I don't see a need
to make widespread changes to other call sites.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-10-08 19:40:04
Message-ID:	CA+TgmoYW33KXktGim1ukOQ_0iVTYLYGLVPZJX8E0pehd51TXjA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 26, 2013 at 9:27 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> > "There's no data corruption problem if we proceed" - but there likely
>> > has been one leading to the current state.
>
> +1 for making this one a PANIC, though. With startup behind us, a valid dsm
> state file pointed us to a control segment with bogus contents. The
> conditional probability of shared memory corruption seems higher than that of
> a DBA editing the dsm state file of a running cluster to incorrectly name as
> the dsm control segment some other existing shared memory segment.

To respond specifically to this point... inability to open a file on
disk does not mean that shared memory is corrupted. Full stop.

A scenario I have seen a few times is that someone changes the
permissions on part or all of $PGDATA while the server is running. I
have only ever seen this happen on Windows. What typically happens
today - depending on the exact scenario - is that the checkpoints will
fail, but the server will remain up, sometimes even committing
transactions under synchronous_commit=off, even though it can't write
out its data. If you fix the permissions before shutting down the
server, you don't even lose any data. Making inability to read a file
into a PANIC condition will cause any such cluster to remain up only
as long as nobody tries to use dynamic shared memory, and then throw
up its guts. I don't think users will appreciate that.

I am tempted to commit the latest version of this patch as I have it.
I think there's a lot of bikeshedding left to be done here, but
there's no real reason why we can't change this subsequent to the
initial commit as the answers become more clear. Changing the error
levels used for particular messages, or rearranging the directory
structure, is quite trivial. But we can't do that as long as we have
N people with >=N opinions on what to do, and the way to get more
clarity there is to get the code out in front of a few more people and
see how things shake out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-10-08 20:00:14
Message-ID:	20131008200014.GC3718183@alap2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-10-08 15:40:04 -0400, Robert Haas wrote:
> I am tempted to commit the latest version of this patch as I have it.

I haven't looked at the latest version of the patch, but based on the
previous version I have no problem with that.

If you'd feel more comfortable with another round of review, scanning
for things other than elevels, I can do that towards the weekend. Before
or after you've committed.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-10-08 23:51:24
Message-ID:	20131008235124.GA236875@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 08, 2013 at 03:40:04PM -0400, Robert Haas wrote:
> On Thu, Sep 26, 2013 at 9:27 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:

> >> > "There's no data corruption problem if we proceed" - but there likely
> >> > has been one leading to the current state.
> >
> > +1 for making this one a PANIC, though. With startup behind us, a valid dsm
> > state file pointed us to a control segment with bogus contents. The
> > conditional probability of shared memory corruption seems higher than that of
> > a DBA editing the dsm state file of a running cluster to incorrectly name as
> > the dsm control segment some other existing shared memory segment.
>
> To respond specifically to this point... inability to open a file on
> disk does not mean that shared memory is corrupted. Full stop.
>
> A scenario I have seen a few times is that someone changes the
> permissions on part or all of $PGDATA while the server is running.

I was discussing the third ereport() in dsm_backend_startup(), which does not
pertain to inability to open a file. The second ereport() would fire in the
damaged-permissions scenario, and I fully agree with that one using ERROR.

Incidentally, dsm_backend_startup() has a typo: s/"one/"none/

> I am tempted to commit the latest version of this patch as I have it.

Works for me.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-10-13 07:07:55
Message-ID:	CAA4eK1K2dMzze2ramOkEZY-REAu4q92qic4aDVzv_6KODy9P2w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 9, 2013 at 1:10 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Sep 26, 2013 at 9:27 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>>> > "There's no data corruption problem if we proceed" - but there likely
>>> > has been one leading to the current state.
>>
>> +1 for making this one a PANIC, though. With startup behind us, a valid dsm
>> state file pointed us to a control segment with bogus contents. The
>> conditional probability of shared memory corruption seems higher than that of
>> a DBA editing the dsm state file of a running cluster to incorrectly name as
>> the dsm control segment some other existing shared memory segment.
>
> To respond specifically to this point... inability to open a file on
> disk does not mean that shared memory is corrupted. Full stop.
>
> A scenario I have seen a few times is that someone changes the
> permissions on part or all of $PGDATA while the server is running. I
> have only ever seen this happen on Windows. What typically happens
> today - depending on the exact scenario - is that the checkpoints will
> fail, but the server will remain up, sometimes even committing
> transactions under synchronous_commit=off, even though it can't write
> out its data. If you fix the permissions before shutting down the
> server, you don't even lose any data. Making inability to read a file
> into a PANIC condition will cause any such cluster to remain up only
> as long as nobody tries to use dynamic shared memory, and then throw
> up its guts. I don't think users will appreciate that.
>
> I am tempted to commit the latest version of this patch as I have it.

1. Do you think we should add information about pg_dynshmem file at link:
http://www.postgresql.org/docs/devel/static/storage-file-layout.html
It contains information about all files/folders in data directory

2.
+/*
+ * Forget that a temporary file is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *seg)
+{

Above function description should use 'dynamic shmem segment' rather
than temporary file.
"Forget that a dynamic shmem segment is owned by a ResourceOwner"

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-10-14 11:41:14
Message-ID:	CA+Tgmoa79r1b2gym=mJvfKEvt3Su5PiicqosUz0c1nK3LYmtDQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Oct 13, 2013 at 3:07 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> 1. Do you think we should add information about pg_dynshmem file at link:
> http://www.postgresql.org/docs/devel/static/storage-file-layout.html
> It contains information about all files/folders in data directory
>
> 2.
> +/*
> + * Forget that a temporary file is owned by a ResourceOwner
> + */
> +void
> +ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *seg)
> +{
>
> Above function description should use 'dynamic shmem segment' rather
> than temporary file.
> "Forget that a dynamic shmem segment is owned by a ResourceOwner"

Good catches, will fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-10-14 15:11:05
Message-ID:	CAA4eK1L9f4TnWPLemZipbF_mHabEPQGPtTH3Mb2QZ7tJmGWkRw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 14, 2013 at 5:11 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Oct 13, 2013 at 3:07 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> 1. Do you think we should add information about pg_dynshmem file at link:
>> http://www.postgresql.org/docs/devel/static/storage-file-layout.html
>> It contains information about all files/folders in data directory
>>
>> 2.
>> +/*
>> + * Forget that a temporary file is owned by a ResourceOwner
>> + */
>> +void
>> +ResourceOwnerForgetDSM(ResourceOwner owner, dsm_segment *seg)
>> +{
>>
>> Above function description should use 'dynamic shmem segment' rather
>> than temporary file.
>> "Forget that a dynamic shmem segment is owned by a ResourceOwner"
>
> Good catches, will fix.

During test, I found one issue in Windows implementation.

During startup, when it tries to create new control segment for
dynamic shared memory, it loops until an unused identifier is found,
but for Windows implementation (dsm_impl_windows()), it was returning
error for EEXIST. This error will convert into FATAL as it is during
postmaster startup and will not allow server to start.

Please find attached patch to fix the problem.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
fix_error_handling_eexist_windows.patch	application/octet-stream	506 bytes

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: dynamic shared memory
Date:	2013-10-14 15:50:44
Message-ID:	CA+Tgmob32a=dX1pAe3wxDe0=vS_4u0R4Wdux=MyugXjyxe388w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 14, 2013 at 11:11 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> During test, I found one issue in Windows implementation.
>
> During startup, when it tries to create new control segment for
> dynamic shared memory, it loops until an unused identifier is found,
> but for Windows implementation (dsm_impl_windows()), it was returning
> error for EEXIST. This error will convert into FATAL as it is during
> postmaster startup and will not allow server to start.
>
> Please find attached patch to fix the problem.

Committed, thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company