Re: dynamic shared memory

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: dynamic shared memory
Date: 2013-09-19 15:44:34
Message-ID: CA+TgmobKLeC8nYWaa6aTW5gvXktuRf6TRttoJbYroi=Nno4qUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 18, 2013 at 1:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Maybe also check for shm_unlink or is that too absurd?

Well, if we find that that there are systems were shm_open is present
and shm_unlink is not present, then we'll need such a test. But I
hesitate to decide what the right thing to do on such systems without
knowing more. And the time it takes to run configure is a very
substantial percentage of the complete build time, at least on my box,
so I don't really want to add tests just because they might be needed
somewhere.

>> --- /dev/null
>> +++ b/src/backend/storage/ipc/dsm.c
>> +#define PG_DYNSHMEM_STATE_FILE PG_DYNSHMEM_DIR "/state"
>> +#define PG_DYNSHMEM_NEW_STATE_FILE PG_DYNSHMEM_DIR "/state.new"
>
> Hm, I guess you dont't want to add it to global/ or so because of the
> mmap implementation where you presumably scan the directory?

Yes, and also because I thought this way would make it easier to teach
things like pg_basebackup (or anybody's home-brew scripts) to just
skip that directory completely. Actually, I was wondering if we ought
to have a directory under pgdata whose explicit charter it was to
contain files that shouldn't be copied as part of a base backup.
pg_do_not_back_this_up.

> Document that's backend local?
> And those are shared memory?

Sure, done.

>> + /* Determine size for new control segment. */
>> + maxitems = PG_DYNSHMEM_FIXED_SLOTS
>> + + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;
>
> It seems likely that MaxConnections would be sufficient?

I think we could argue about the best way to set this until the cows
come home, but I don't think it probably matters much at this point.
We can always change the formula later as we gain experience.
However, I don't have a principled reason for assuming that only
user-connected backends will create dynamic shared memory segments.

> Comment that we loop endlessly to find a unused identifier.

Done.

> Why do we create the control segment in dynamic smem and not in the
> normal shmem? Presumably because this way it has the same lifetime? If
> so, that should be commented upon.

If it were part of the normal shared memory segment, I don't think
there'd be any good way to implement
dsm_cleanup_using_control_segment().

> So, we leave it just hanging around... Well, it has precedent in our
> normal shared memory handling.

And, it could be long to some unrelated process.

> I still maintain that the extra infrastructure required isn't worth the
> gain of having the mmap implementation.

I know.

>> +/*
>> + * Read and parse the state file.
>> + *
>
> Perhaps CRC32 the content?

I don't see the point. If the file contents are garbage that happens
to look like a number, we'll go "oh, there isn't any such segment" or
"oh, there is such a segment but it doesn't look like a control
segment, so forget it". There are a lot of things we really ought to
be CRCing to avoid corruption risk, but I can't see how this is
remotely one of them.

>> + /* Create or truncate the file. */
>> + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
>
> Doesn't this need a | PG_BINARY?

It's a text file. Do we need PG_BINARY anyway?

> Why are you using open() and not
> BasicOpenFile or even OpenTransientFile?

Because those don't work in the postmaster.

>
>> + /* Write contents. */
>> + snprintf(statebuf, PG_DYNSHMEM_STATE_BUFSIZ, "%lu\n",
>> + (unsigned long) dsm_control_handle);
>
> Why are we upcasting the length of dsm_control_handle here? Also,
> doesn't this need the usual UINT64_FORMAT thingy?

dsm_handle is an alias for uint32. Is that always exactly an unsigned
int or can it sometimes be an unsigned long? I thought the latter, so
couldn't figure out how to write this portably without casting to a
type that explicitly matched the format string.

> But we don't even get calld in that case, right? You're only regitering
> the on_shmem_exit() handler after the same check in startup?

True. Removed.

> I'd rename dsm_control_segment_sane to dsm_control_segment_looks_sane ;)

Meh. I write too many really long function names as it is.

> I'd include the refcount here, might be helpful for debugging.

OK, done.

> Not sure whether it's sensible to only LOG in these cases. After all
> there's something unexpected happening. The robustness argument doesn't
> count since we're already shutting down.

I see no point in throwing an error. The fact that we're having
trouble cleaning up one dynamic shared memory segment doesn't mean we
shouldn't try to clean up others, or that any remaining postmaster
shutdown hooks shouldn't be executed.

>> + ereport(ERROR,
>> + (errcode(ERRCODE_INTERNAL_ERROR),
>> + errmsg("dynamic shared memory control segment is not valid")));
>
> Imo that's a PANIC or at the very least a FATAL.

Sure, that's a tempting option, but it doesn't seem to serve any very
necessary point. There's no data corruption problem if we proceed
here. Most likely either (1) there's a bug in the code, which
panicking won't fix or (2) the DBA hand-edited the state file, in
which case maybe he shouldn't have done that, but if he thinks the
best way to recover from that is a cluster-wide restart, he can do
that himself.

>> +dsm_segment *
>> +dsm_create(uint64 size)
>
> Do we rely on being run in an environment with proper setup for lwlock
> cleanup? I can imagine shared libraries doing this pretty early on...

Yes, we rely on that. I don't really see that as a problem. You'd
better connect to the main shared memory segment before starting to
create your own.

> Why can we see that state? Shouldn't locking prevent that?

We initialize the refcnt to 2 on creation, so that it ends up as 1
when the last reference is gone. This prevents race conditions:
imagine that one process creates a DSM and then passes the handle to
some other process. But, before that second process gets around to
mapping it, the first process dies. At that point, there are no
remaining references to the dynamic shared memory segment, so it goes
away, as per spec. But what if we're in the middle of destroying when
the second process finally gets around to trying to map it? In such a
situation, the behavior would be implementation-specific. I felt that
it was better not to find out whether all operating systems and shared
memory types handle such cases similarly, because I'm pretty sure they
don't. The two-stage destruction process means that once we've
committed to destroying the segment, no one else will attempt to map
it.

>> +void *
>> +dsm_resize(dsm_segment *seg, uint64 size)
>> +{
>> + Assert(seg->control_slot != INVALID_CONTROL_SLOT);
>> + dsm_impl_op(DSM_OP_RESIZE, seg->handle, size, &seg->impl_private,
>> + &seg->mapped_address, &seg->mapped_size, ERROR);
>> + return seg->mapped_address;
>> +}
>
> Hm. That's valid when there are other backends attached? What are the
> implications for already attached ones?

They'll continue to see the portion they have mapped, but must do
dsm_remap() if they want to see the whole thing.

> Shouldn't we error out if !dsm_impl_can_resize()?

The implementation-specific code throws an error if it can't support
resize. Even if we put a secondary check here, I wouldn't want
dsm_impl_op to behave in an undefined manner when asked to resize
under an implementation that can't. And there doesn't seem to be much
point in having two checks.

>> +void
>> +dsm_detach(dsm_segment *seg)
>> +{
>
> Why do we want to ignore errors like failing to unmap? ISTM that
> indicates an actual problem...

Sure it does. But what are you going to do about it? In many cases,
you're going to get here during a transaction abort caused by some
other error. If the transaction is already aborting, throwing an
error here will just cause the original error to get discarded in
favor of showing this one, or maybe it's the other way around. I
don't remember, but it's definitely one or the other, and neither is
desirable. Throwing a warning, on the other hand, will notify the
user, which is what we want.

Now on the flip side we might not be aborting; maybe we're committing.
But we don't want to turn a commit into an abort just for this. If
resowner.c detects a buffer pin leak or a tuple descriptor leak, those
are "just" warning as well. They're serious warnings, of course, and
if they happen it means there's a bug in the code that needs to be
fixed. But the severity of an ereport() isn't based just on how
alarming the situation is; it's based on what you want to happen when
that situation comes up. And we've decided (correctly, I think) that
resource leaks are not grounds for aborting a transaction that
otherwise would have committed.

> Yuck. So that's the answer to my earlier question about the legality of
> seing a refcount of 1....

Read it and weep.

> *same

Fixed, thanks.

>> + * several implementations of this facility. Many systems implement
>> + * POSIX shared memory (shm_open etc.), which is well-suited to our needs
>> + * in this area, with the exception that shared memory identifiers live
>> + * in a flat system-wide namespace, raising the uncomfortable prospect of
>> + * name collisions with other processes (including other copies of
>> + * PostgreSQL) running on the same system.
>
> Why isn't the port number part of the posix shmem identifiers? Sure, we
> retry, but using a logic similar to sysv_shmem.c seems like a good idea.

According to the man page for shm_open on Solaris, "For maximum
portability, name should include no more than 14 characters, but this
limit is not enforced."

http://www.unix.com/man-page/OpenSolaris/3c/shm_open/

I'm unclear whether there are any real systems that have a problem with this.

>> +bool
>> +dsm_impl_can_resize(void)
>> +{
>> + switch (dynamic_shared_memory_type)
>> + {
>> + case DSM_IMPL_NONE:
>> + return false;
>> + case DSM_IMPL_SYSV:
>> + return false;
>> + case DSM_IMPL_WINDOWS:
>> + return false;
>> + default:
>> + return true;
>> + }
>> +}
>
> Looks to me like the logic should be the reverse.

I've changed it to list all the cases explicitly.

>> + char name[64];
>> + int flags;
>> + int fd;
>> + char *address;
>> +
>> + snprintf(name, 64, "/PostgreSQL.%lu", (unsigned long) handle);
>
> Why wider than the handle?

Same as above - not sure that uint32 == unsigned int everywhere.

>> + /*
>> + * If we're reattaching or resizing, we must remove any existing mapping,
>> + * unless we've already got the right thing mapped.
>> + */
>> + if (*mapped_address != NULL)
>> + {
>> + if (*mapped_size == request_size)
>> + return true;
>
> Hm. It could have gotten resized to the old size, or resized twice. In
> that case it might not be at the same address before, so checking for
> the size doesn't seem to be sufficient.

I don't understand your concern. If someone resize the DSM to its
already-current size, there is no need to remap it. The old mapping
is just fine. And if some other backend resizes the DSM to a larger
size and then back to the original size, and then we're asked to
update the mapping, there is no need to change anything.

> Hm. You're elog(elevel) here, but the retry code in dsm_create() passes
> in ERROR?

Oh, that's bad. Fixed.

>> +static int
>> +errcode_for_dynamic_shared_memory()
>> +{
>> + if (errno == EFBIG || errno == ENOMEM)
>> + return errcode(ERRCODE_OUT_OF_MEMORY);
>> + else
>> + return errcode_for_file_access();
>> +}
>
> Is EFBIG guaranteed to be defined?

I dunno. We could put an #ifdef around that part. Should we do that
now or wait and see if it actually breaks anywhere?

> Not really an issue, but this will grow owner->dsm unnecessarily because
> ResourceOwnerEnlargeDSMs() will have been done previously.

I tried to copy the existing uses of resowner.c as closely as
possible; if you think that there's something I should be doing to
mimic it more closely, let me know.

> Not sure yet how happy I am with the separation of concerns between
> dsm.c and dsm_impl.c...

I was hoping for a cleaner abstraction break, but I couldn't make it
work out any better than this. Even so, I think it's worthwhile
having two files; an imperfect separation of concerns still seems
better than concatenating them into one really long file. (FWIW, the
combined file would be longer than 84% of the 622 .c files in the
backend; as a fan of keeping .c files relatively small, I'm not eager
to be the cause of us having more large ones.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
dynshmem-v3.patch application/octet-stream 84.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-09-19 15:49:05 Re: Some interesting news about Linux 3.12 OOM
Previous Message Merlin Moncure 2013-09-19 15:30:41 Re: Some interesting news about Linux 3.12 OOM