Quick Links

Re: same-address mappings vs. relative pointers

Lists:	pgsql-hackers

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	same-address mappings vs. relative pointers
Date:	2013-12-05 04:32:27
Message-ID:	CA+TgmobbpRSJsjf9O18sfkNNh58HfGCoDO2Be8dA+eijWBjvTw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

During development of the dynamic shared memory facility, Noah and I
spent a lot of time arguing about whether it was practical to ensure
that a dynamic shared memory segment got mapped at the same address in
every backend that used it. The argument went something like this:

Me: We'll never be able to make that work reliably.
Noah: But if we can't use pointers in the dynamic shared memory
segment, our lives will suck.
Me: Well, we probably don't really NEED to use data structures that
contain pointers all THAT much.
Noah: Are you nuts? Of course we need pointers. They're ubiquitous
and essential.
Me: Meh.

I felt somewhat vindicated when I finished the dynamic shared memory
message queuing patch (which I'm still hoping someone will review
sometime soon...?) since that constitutes a useful chunk of
functionality that doesn't care about pointers at all. And there are
surely other examples that fall into the same category; for example,
an lwlock doesn't contain any pointers today, so storing one in a
dynamic shared memory segment in an address that might vary from one
process to another ought to work OK, too. I think there was actually
a patch a few years ago that made everything use LWLock * rather than
LWLockId, which would allow considerably more flexibility in laying
out lwlocks in memory - so you could for example try to put the in the
same cache lines as the data they protect, or different cache lines
than other hot lwlocks - and would probably also be almost enough to
allow placing them in a dynamic shared memory segment, which would be
useful.

But I'm also learning painfully that this kind of thing only goes so
far. For example, I spent some time looking at what it would take to
provide a dynamic shared memory equivalent of palloc/pfree, a facility
that I feel fairly sure would attract a few fans. Well, for regular
palloc, we store a pointer to the memory context before the beginning
of the chunk, so that when you call pfree you can find the memory
context and stick the chunk on one of its free lists. So there are
two pointers there: the pointer to the context is a pointer, of
course, but so is the free list. Heh, heh.

As I see it, if we want to have facilities like this, we'll have to
either (1) make same-address mappings work for as many architectures
as possible and don't support these facilities on the remainder or (2)
use relative pointers instead of absolute pointers within dynamic
shared memory segments, which means a loss of performance, notational
clarity, and type-safety. We can also (3) adopt both approaches -
some facilities can use relative pointers, which will be portable
everywhere but annoying otherwise, and others can work only when
same-address mappings are supported. Or we can (4) adopt neither
approach, and confine ourselves to data structures that don't use
pointers.

I still have mixed feelings about the idea of same-address mappings.
On the one hand, on 64-bit architectures, there's a huge amount of
address space available. Assuming the OS does something even vaguely
sensible in terms of laying out the text, heap, stack, and shared
library mappings, there ought to be many petabytes of address space
that never gets touched, and it's hard to see why we couldn't find
some place in there to stick our stuff. But that could require quite
a bit of OS-specific knowledge about how memory gets laid out. One
idea that I think is originally Noah's, though I may be mutilating it,
is to create a very large PROT_NONE mapping in the postmaster and then
overwrite that mapping with the mapping for any dynamic shared memory
segments we subsequently want to create. In that way, we essentially
reserve the address space we want to use before the child is forked
and things start to diverge (due to memory allocations, additional
shared library loads, etc.). But I bet that on at least some
operating systems that will actually allocate memory, or at least
count toward the system's notion of overcommit, and that will be a
problem. So I don't have any really good idea for how to implement
this cleanly.

Now, on the other hand, as far as dynamic shared memory allocation and
freeing is concerned, there aren't really THAT many places where I
need a pointer, so using Size or uint64 or something to store an
offset instead is annoying, but I have an idea how to do this that
only uses pointers in a couple of places, so I think it can be made to
work. I am not sure how much complaint that will provoke, though.
And even if I do it, the first poor slob that wants to store a linked
list or so in dynamic shared memory is going to be unhappy if they
can't get a same-address mapping. Maybe that's OK; using linked lists
in shared memory might not be a great idea in the first place. I'm
sure there will be more than one person who wants to do it, though.

Any thoughts on what the least painful compromise is here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: same-address mappings vs. relative pointers
Date:	2013-12-05 09:56:41
Message-ID:	20131205095641.GF28793@alap2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Robert,

On 2013-12-04 23:32:27 -0500, Robert Haas wrote:
> But I'm also learning painfully that this kind of thing only goes so
> far. For example, I spent some time looking at what it would take to
> provide a dynamic shared memory equivalent of palloc/pfree, a facility
> that I feel fairly sure would attract a few fans.

I have to admit, I don't fully see the point in a real palloc()
equivalent for shared memory. If you're allocating shared memory in a
granular fashion at somewhat high frequency you're doing something
*wrong*. And the result is not going to be scalable.

> Well, for regular
> palloc, we store a pointer to the memory context before the beginning
> of the chunk, so that when you call pfree you can find the memory
> context and stick the chunk on one of its free lists. So there are
> two pointers there: the pointer to the context is a pointer, of
> course, but so is the free list. Heh, heh.

This is one of those times where live really, really would be easier if
we were using c++. Just have small smart-pointer wrapper, and we
wouldn't need to worry too much.

> I still have mixed feelings about the idea of same-address mappings.
> On the one hand, on 64-bit architectures, there's a huge amount of
> address space available. Assuming the OS does something even vaguely
> sensible in terms of laying out the text, heap, stack, and shared
> library mappings, there ought to be many petabytes of address space
> that never gets touched, and it's hard to see why we couldn't find
> some place in there to stick our stuff.

It's actually quite a bit away from petabytes. Most x86-64 CPUs have
48bit of virtual memory, with the OS eating up a far bit of that (so it
doesn't need to tear down TLBs in system calls). I think cross-platform
you end up with something around 8TB, up to 64TB on others OSs.
Now, there are some CPUs coming out with bigger virtual memory sizes,
but it's going to be longer than I am willing to wait till we are there.

> Now, on the other hand, as far as dynamic shared memory allocation and
> freeing is concerned, there aren't really THAT many places where I
> need a pointer, so using Size or uint64 or something to store an
> offset instead is annoying, but I have an idea how to do this that
> only uses pointers in a couple of places, so I think it can be made to
> work. I am not sure how much complaint that will provoke, though.
> And even if I do it, the first poor slob that wants to store a linked
> list or so in dynamic shared memory is going to be unhappy if they
> can't get a same-address mapping. Maybe that's OK; using linked lists
> in shared memory might not be a great idea in the first place. I'm
> sure there will be more than one person who wants to do it, though.
> a

Just because people want it, doesn't make it worthwile to provide it ;)

> Any thoughts on what the least painful compromise is here?

I think a reasonable route is having some kind of smart-pointer on C
level that abstracts away the offset math and allows to use pointers
locally. Something like void *sptr_deref(sptr *); where the returned
pointer can be used as long as it is purely in memory. And
sptr_update(sptr *, void *); which allows a sptr to point elsewhere in
the same segment.
+ lots of smarts

I continue to believe that a) using pointers in dynamically allocated
segments is going to end up with lots of pain. b) the pain from not
having real pointers is manageable.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: same-address mappings vs. relative pointers
Date:	2013-12-05 12:44:27
Message-ID:	CA+Tgmoanps9DAmVb7XUNP5nrV01A_PzpvsLSjJKnGG_Azx2Zog@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 5, 2013 at 4:56 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi Robert,
>
> On 2013-12-04 23:32:27 -0500, Robert Haas wrote:
>> But I'm also learning painfully that this kind of thing only goes so
>> far. For example, I spent some time looking at what it would take to
>> provide a dynamic shared memory equivalent of palloc/pfree, a facility
>> that I feel fairly sure would attract a few fans.
>
> I have to admit, I don't fully see the point in a real palloc()
> equivalent for shared memory. If you're allocating shared memory in a
> granular fashion at somewhat high frequency you're doing something
> *wrong*. And the result is not going to be scalable.

Why? Lots of people have written lots of programs that do just that.
I agree it's overdone, but saying it should never happen seems like an
over-rotation in the opposite direction.

In any case, the application that led me to this was actually parallel
internal sort. Let's suppose we had an implementation of parallel
internal sort. You'd load all the tuples you want to sort into
dynamic shared memory (just as we now load them into palloc'd chunks),
start a bunch of workers, and the original backend and the
newly-started workers would cooperate to sort your data. Win.

It's fairly clear that, technically speaking, you do NOT need a real
allocator for this. You can just shove the first tuple that you store
into shared memory, say right up against the end of the region. Then
you can place the next tuple immediately before it and the following
one just before that, and so on. This is very simple to implement and
also extremely memory-efficient, so superficially it's quite an
appealing approach.

But now let's suppose the input data was estimated to fit in work_mem
but in the end did not, and therefore we need to instead do an
external sort. That means we're going to start evicting tuples from
memory and to make room for new tuples we read from the input stream.
Well, now our strategy of tight-packing everything does not look so
good, because we have no way of tracking which tuples we no longer
need and reusing that space for new tuples. If external sort is not
something we know how to do in parallel, we can potentially copy all
of the tuples out of the dynamic shared memory segment and back to
backend-private memory in palloc'd chunks, and then deallocate the
dynamic shared memory segment... but that's potentially expensive if
work_mem is large, and it temporarily uses double what we're allowed
by work_mem.

And suppose we want to do a parallel external sort with, say, the
worker backend pushing tuples into a heap stored in the dynamic shared
memory segment and other backends writing them out to tapes and
removing them from the heap. You can't do that without some kind of
space management, i.e. an allocator.

> This is one of those times where live really, really would be easier if
> we were using c++. Just have small smart-pointer wrapper, and we
> wouldn't need to worry too much.

Yes, a template class would be perfect here.

>> I still have mixed feelings about the idea of same-address mappings.
>> On the one hand, on 64-bit architectures, there's a huge amount of
>> address space available. Assuming the OS does something even vaguely
>> sensible in terms of laying out the text, heap, stack, and shared
>> library mappings, there ought to be many petabytes of address space
>> that never gets touched, and it's hard to see why we couldn't find
>> some place in there to stick our stuff.
>
> It's actually quite a bit away from petabytes. Most x86-64 CPUs have
> 48bit of virtual memory, with the OS eating up a far bit of that (so it
> doesn't need to tear down TLBs in system calls). I think cross-platform
> you end up with something around 8TB, up to 64TB on others OSs.
> Now, there are some CPUs coming out with bigger virtual memory sizes,
> but it's going to be longer than I am willing to wait till we are there.

Oh. Well, that's a shame.

>> Any thoughts on what the least painful compromise is here?
>
> I think a reasonable route is having some kind of smart-pointer on C
> level that abstracts away the offset math and allows to use pointers
> locally. Something like void *sptr_deref(sptr *); where the returned
> pointer can be used as long as it is purely in memory. And
> sptr_update(sptr *, void *); which allows a sptr to point elsewhere in
> the same segment.
> + lots of smarts

Can you elaborate on this a little bit? I'm not sure I understand
what you're getting at here.

> I continue to believe that a) using pointers in dynamically allocated
> segments is going to end up with lots of pain. b) the pain from not
> having real pointers is manageable.

Fair opinion, but I think we will certainly need to pass around memory
offsets in some form for certain things. Even for purely internal
parallel sort, the Tuplesort objects need to store the offset of the
actual tuple within the segment. My first guess as to how to do that
was to assume that sizeof(Size) == sizeof(void *) and just store
offsets as a Size, but then I thought maybe it'd be better to do
something like typedef Size relptr. And then I thought, boy, it sucks
not to be able to declare what kind of a thing we're pointing *at*
here, but apart from using C++ I see no solution to that problem. I
guess we could do something like this:

#define relptr(type) Size

So the compiler wouldn't enforce anything, but at least notationally
we'd know what sort of object we were supposedly referencing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: same-address mappings vs. relative pointers
Date:	2013-12-05 13:57:22
Message-ID:	52A08642.7020401@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/05/2013 06:32 AM, Robert Haas wrote:
> During development of the dynamic shared memory facility, Noah and I
> spent a lot of time arguing about whether it was practical to ensure
> that a dynamic shared memory segment got mapped at the same address in
> every backend that used it.

My vote goes for not trying to map at same address. I don't see how you
could do that reliably, and I don't see much need for it anyway.

That said, it naturally depends on what you're going to use the dynamic
shared memory facility for. It's the same problem I have with reviewing
the already-committed DSM patch and the message queue patch. The patches
look fine as far as they go, but I have the nagging feeling that there
are a bunch of big patches coming up later that use the facilities, and
I can't tell if the facilities are over-engineered for what's actually
needed, or not sufficient.

As a side-note, I've been thinking that we don't really need
same-address mapping for shared_buffers either. Getting rid of it
wouldn't buy us anything right now, but if we wanted e.g to make
shared_buffers changeable without a restart, that would be useful.

- Heikki

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: same-address mappings vs. relative pointers
Date:	2013-12-05 14:00:42
Message-ID:	20131205140042.GG14419@alap2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-12-05 15:57:22 +0200, Heikki Linnakangas wrote:
> As a side-note, I've been thinking that we don't really need same-address
> mapping for shared_buffers either. Getting rid of it wouldn't buy us
> anything right now, but if we wanted e.g to make shared_buffers changeable
> without a restart, that would be useful.

I doubt it's that easy to gid of atm (at least in !EXEC_BACKEND), but if
we ever want to properly support ALSR in EXEC_BACKEND environments, we
might need to go there. The hacks windows does around it are already
quite ugly.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: same-address mappings vs. relative pointers
Date:	2013-12-05 14:44:34
Message-ID:	20131205144434.GG12398@alap2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-12-05 07:44:27 -0500, Robert Haas wrote:
> On Thu, Dec 5, 2013 at 4:56 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > Hi Robert,
> >
> > On 2013-12-04 23:32:27 -0500, Robert Haas wrote:
> >> But I'm also learning painfully that this kind of thing only goes so
> >> far. For example, I spent some time looking at what it would take to
> >> provide a dynamic shared memory equivalent of palloc/pfree, a facility
> >> that I feel fairly sure would attract a few fans.
> >
> > I have to admit, I don't fully see the point in a real palloc()
> > equivalent for shared memory. If you're allocating shared memory in a
> > granular fashion at somewhat high frequency you're doing something
> > *wrong*. And the result is not going to be scalable.

> Why? Lots of people have written lots of programs that do just that.

Well, but we're a database, not a generic programming library ;)

> I agree it's overdone, but saying it should never happen seems like an
> over-rotation in the opposite direction.

That might be true.

> But now let's suppose the input data was estimated to fit in work_mem
> but in the end did not, and therefore we need to instead do an
> external sort. That means we're going to start evicting tuples from
> memory and to make room for new tuples we read from the input stream.
> Well, now our strategy of tight-packing everything does not look so
> good, because we have no way of tracking which tuples we no longer
> need and reusing that space for new tuples. If external sort is not
> something we know how to do in parallel, we can potentially copy all
> of the tuples out of the dynamic shared memory segment and back to
> backend-private memory in palloc'd chunks, and then deallocate the
> dynamic shared memory segment... but that's potentially expensive if
> work_mem is large, and it temporarily uses double what we're allowed
> by work_mem.

But what's your alternative if you have a shared_palloc() like thingy?
The overhead is going to be crazy if you allocate so granular that you
can easily grow, and if you allocate on a coarse level, this isn't going
to buy you much?

Essentially, I am not arguing against a facility to dynamically allocate
memory from shared memory, but I think the tradeoffs are different
enough that I think palloc()/mcxt.c isn't necessarily the thing to model
it after. But then, I don't really have an idea how it should look like,
not having thought about it much.

> And suppose we want to do a parallel external sort with, say, the
> worker backend pushing tuples into a heap stored in the dynamic shared
> memory segment and other backends writing them out to tapes and
> removing them from the heap. You can't do that without some kind of
> space management, i.e. an allocator.

Hm. I'd have thought one would implement that by pushing fixed size
amounts of tuples to individual workers, let them sort each chunk in
memory and then write the result out to disk.

> >> Any thoughts on what the least painful compromise is here?
> >
> > I think a reasonable route is having some kind of smart-pointer on C
> > level that abstracts away the offset math and allows to use pointers
> > locally. Something like void *sptr_deref(sptr *); where the returned
> > pointer can be used as long as it is purely in memory. And
> > sptr_update(sptr *, void *); which allows a sptr to point elsewhere in
> > the same segment.
> > + lots of smarts
>
> Can you elaborate on this a little bit? I'm not sure I understand
> what you're getting at here.

So, my thought is that we really want something that acts like a
pointer, just in shared memory, so we don't have to do too much offset
math. And I think for performance and readability we want to use
pointers when operating locally in a backend.

So, I guess there are situations in which we essentially want pointers:
a) pointing somewhere in the same segment.
b) pointing at an offset in *another* segment.

So, for a) what if we, instead of storing plain offsets have something like:
typedef struct sptr
{
Offset sptr_self;
Offset sptr_offset;
} sptr;

and always store that in dynamically allocated shared memory instead of
raw offsets.
That allows to define a function like:
void *
sptr_deref(sptr *sp)
{
return ((char *)sp) - sp->sptr_self + sp->sptr_offset;
}
to get what we point to. Without having to reference the segment base,
at the price of having to store two offsets.
To update the pointer we could have:
void
sptr_set(sptr *sp, void *p)
{
char *segment_base;

segment_base = ((char *)sp) - sp->sptr_self;

/* make sure pointer doesn't point below the beginning of the segment */
Assert(segment_base <= p);

/* make sure pointer doesn't point beyond the segment */
Assert(p < segment_base + GetSegmentSize(segment_base));

sp->sptr_offset = ((char *)p) - segment_base;
}
To initialize we'd have:
void
sptr_init(dsm_segment *segment, sptr *sp, void *p)
{
sp->sptr_self = ((char *)sp) - (char *)(segment->mapped_address)
sptr_set(sp, p);
}

for b) we could have
typedef struct sptr_far
{
dsm_handle far_handle;
sptr far_ptr;
}
and equivalent sptr_far_deref/set/init. Although that probably would
require a more efficient manner to resolve a dsm_handle to the dsm_segment.

Makes at least some sense?

> > I continue to believe that a) using pointers in dynamically allocated
> > segments is going to end up with lots of pain. b) the pain from not
> > having real pointers is manageable.
>
> Fair opinion, but I think we will certainly need to pass around memory
> offsets in some form for certain things.

Absolutely - I am not sure where the "but" is coming from ;)

> And then I thought, boy, it sucks
> not to be able to declare what kind of a thing we're pointing *at*
> here, but apart from using C++ I see no solution to that problem. I
> guess we could do something like this:
>
> #define relptr(type) Size
>
> So the compiler wouldn't enforce anything, but at least notationally
> we'd know what sort of object we were supposedly referencing.

There might be some ugly compiler dependent magic we could do. Depending
on how we decide to declare offsets. Like (very, very roughly)

#define relptr(type, struct_name, varname) union struct_name##_##varname{ \
type relptr_type; \
Offset relptr_off;
}

And then, for accessing have:
#define relptr_access(seg, off) \
typeof(off.relptr_type)* (((char *)seg->base_address) + off.relptr_off)

But boy, that's ugly.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: same-address mappings vs. relative pointers
Date:	2013-12-05 14:49:42
Message-ID:	20131205144942.GH12398@alap2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-12-05 15:44:34 +0100, Andres Freund wrote:
> On 2013-12-05 07:44:27 -0500, Robert Haas wrote:
> > And then I thought, boy, it sucks
> > not to be able to declare what kind of a thing we're pointing *at*
> > here, but apart from using C++ I see no solution to that problem. I
> > guess we could do something like this:
> >
> > #define relptr(type) Size
> >
> > So the compiler wouldn't enforce anything, but at least notationally
> > we'd know what sort of object we were supposedly referencing.
>
> There might be some ugly compiler dependent magic we could do. Depending
> on how we decide to declare offsets. Like (very, very roughly)
>
> #define relptr(type, struct_name, varname) union struct_name##_##varname{ \
> type relptr_type; \
> Offset relptr_off;
> }
>
> And then, for accessing have:
> #define relptr_access(seg, off) \
> typeof(off.relptr_type)* (((char *)seg->base_address) + off.relptr_off)
>
> But boy, that's ugly.

On second thought - there's probably no reason to name the union, making
it somewhat less ugly.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: same-address mappings vs. relative pointers
Date:	2013-12-05 15:00:55
Message-ID:	CA+TgmoYkbNZiq7vLYyAL4--NE4T6626CR4-i3K7_yjptnuYbcg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 5, 2013 at 8:57 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 12/05/2013 06:32 AM, Robert Haas wrote:
>> During development of the dynamic shared memory facility, Noah and I
>> spent a lot of time arguing about whether it was practical to ensure
>> that a dynamic shared memory segment got mapped at the same address in
>> every backend that used it.
>
> My vote goes for not trying to map at same address. I don't see how you
> could do that reliably, and I don't see much need for it anyway.
>
> That said, it naturally depends on what you're going to use the dynamic
> shared memory facility for. It's the same problem I have with reviewing the
> already-committed DSM patch and the message queue patch. The patches look
> fine as far as they go, but I have the nagging feeling that there are a
> bunch of big patches coming up later that use the facilities, and I can't
> tell if the facilities are over-engineered for what's actually needed, or
> not sufficient.

Sure, well, that's one of the challenges of any doing any sort of
large software development project. One rarely knows at the outset
all of the problems that one will encounter before finishing said
project. For small projects, you can usually predict these things
pretty well, but as the scope of the project goes, it becomes more and
more difficult to know whether you've made the right initial steps.
That having been said, I'm pretty confident in the steps taken thus
far - but if you're imagining that I have all the answers completely
worked out and am choosing to reveal them only bit by bit, it's not
like that.

If you want to see the overall vision for this project, see
https://wiki.postgresql.org/wiki/Parallel_Sort . Here, I expect to
use dynamic shared memory for three purposes. First, I expect to use
a shared memory message queue to propagate any error or warning
messages generated in the workers back to the user backend. That's
the point of introducing that infrastructure now, though I hope it
will eventually also be suitable for streaming tuples between
backends, so that you can run one part of the query tree in one
backend and stream the output to a different backend that picks it up
and processes it further. We could surely contrive a simpler solution
just for error messages, but I viewed that as short-sighted. Second,
I expect to store the SortTuple array, or some analogue of it, in
dynamic shared memory. Third, I expect to store the tuples themselves
in dynamic shared memory.

Down the road, I imagine wanting to put hash tables in shared memory,
so we can parallelize things like hash joins and hash aggregates.

> As a side-note, I've been thinking that we don't really need same-address
> mapping for shared_buffers either. Getting rid of it wouldn't buy us
> anything right now, but if we wanted e.g to make shared_buffers changeable
> without a restart, that would be useful.

Very true. One major obstacle to that is that changing the size of
shared_buffers also means resizing the LWLock array and the buffer
descriptor array. If we got rid of the idea of having lwlocks in
their own data structure and moved the buffer lwlocks into the buffer
descriptors, that would get us down to two segments, but that still
feels like one too many. There's also the problem of the buffer
mapping hash table, which would need to grow somehow as well.
Technically, the size of the fsync absorb queue also depends on
shared_buffers, but we could decide not to care about that, I think.

The other problem here is that once you do implement all this, a
reference to a buffer beyond what your backend has mapped will
necessitate an unmap and remap of the shared-buffers segment. If the
remap fails, and you hold any buffer pins, you will have to PANIC.
There could be some performance overhead from inserting bounds checks
in all the right places too, although there might not be enough places
to matter, since a too-high buffer number can only come from a limited
number of places - either a lookup in the buffer mapping table, or a
buffer allocation event.

I don't mention any of these things to discourage you from working on
the problem, but rather to because I've thought about it too - and the
aforementioned problems have are the things that have stumped me so
far. If you've got ideas about how to solve them, or even better yet
want to implement something, great!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: same-address mappings vs. relative pointers
Date:	2013-12-05 15:17:18
Message-ID:	CA+TgmoYQnaBi=3gzvHVgV6ZAdqvA4m=6XXKQsn7g_RmB2gz-0Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 5, 2013 at 9:44 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> Why? Lots of people have written lots of programs that do just that.
>
> Well, but we're a database, not a generic programming library ;)

I think we're arguably both.

> But what's your alternative if you have a shared_palloc() like thingy?
> The overhead is going to be crazy if you allocate so granular that you
> can easily grow, and if you allocate on a coarse level, this isn't going
> to buy you much?

I don't know if you've noticed, but the overhead of palloc is pretty
much insane for large memory contexts right now. Consider a bunch of
100-byte tuples that you want to sort, on a 64-bit machine. IIUC,
we're going to round the request up to 128 bytes and then add 16 bytes
of overhead for 2 pointers before the beginning of the chunk. That's
44% overhead. For small memory contexts like the ones that store
syscache entries it's really hard to do any better, but there are much
better algorithms (which I am investigating) for cases where you
expect to allocate a large amount of memory. 44% overhead on a 1kB
is no big deal; on 1GB it's a big deal.

But the question of how efficient palloc, or a hypothetical
shared_palloc() is, is really separate from whether or not we ought to
have one. All allocators have some overhead, and people use them
anyway because it drastically simplifies programming.

>> And suppose we want to do a parallel external sort with, say, the
>> worker backend pushing tuples into a heap stored in the dynamic shared
>> memory segment and other backends writing them out to tapes and
>> removing them from the heap. You can't do that without some kind of
>> space management, i.e. an allocator.
>
> Hm. I'd have thought one would implement that by pushing fixed size
> amounts of tuples to individual workers, let them sort each chunk in
> memory and then write the result out to disk.

I haven't really investigated external sort algorithms at this point,
so it is possible that you are right. However, that sounds like the
equivalent of constructing runs by loading a batch of tuples,
quicksorting them, writing them out, and then moving on to the next
batch, a modification to tuplesort.c which I have tried and found to
be significantly inferior to the present algorithm. As Tom has been
at pains to point out, our current algorithm produces much longer
runs, especially if the input data has some intrinsic ordering, which
is not an uncommon situation. So while I don't know for sure, I think
it's probably unwise to assume that the algorithm we pick won't
require the ability to remove tuples from the dsm incrementally, and I
don't want the choice of algorithm to be constrained by overly-rigid
architectural decisions from the outset.

> So, for a) what if we, instead of storing plain offsets have something like:
> typedef struct sptr
> {
> Offset sptr_self;
> Offset sptr_offset;
> } sptr;

I like the name Offset, but I'm firmly convinced that a relative
pointer needs to be the same size as a regular pointer. 8 bytes for a
pointer is already painfully large in some contexts; 16 is just too
much.

>> And then I thought, boy, it sucks
>> not to be able to declare what kind of a thing we're pointing *at*
>> here, but apart from using C++ I see no solution to that problem. I
>> guess we could do something like this:
>>
>> #define relptr(type) Size
>>
>> So the compiler wouldn't enforce anything, but at least notationally
>> we'd know what sort of object we were supposedly referencing.
>
> There might be some ugly compiler dependent magic we could do. Depending
> on how we decide to declare offsets. Like (very, very roughly)
>
> #define relptr(type, struct_name, varname) union struct_name##_##varname{ \
> type relptr_type; \
> Offset relptr_off;
> }
>
> And then, for accessing have:
> #define relptr_access(seg, off) \
> typeof(off.relptr_type)* (((char *)seg->base_address) + off.relptr_off)
>
> But boy, that's ugly.

It's clever, though, and probably worth further experimentation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: same-address mappings vs. relative pointers
Date:	2013-12-05 15:43:14
Message-ID:	20131205154314.GI14419@alap2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-12-05 10:17:18 -0500, Robert Haas wrote:
> On Thu, Dec 5, 2013 at 9:44 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> Why? Lots of people have written lots of programs that do just that.
> >
> > Well, but we're a database, not a generic programming library ;)
>
> I think we're arguably both.

Fair enough.

> I don't know if you've noticed, but the overhead of palloc is pretty
> much insane for large memory contexts right now.

Yea, I have more than noticed. I actually still have a partial
replacement of aset.c lying around. It didn't even have 16bytes
overhead, although that had required some ugly hacks when wanting to
continue allowing several parallel context implementations.

Btw, it's even worse if you have lots of allocations above
ALLOC_CHUNK_LIMIT, we walk a linear list on reset, and we reset damn
frequently - partially because we think it's cheap.

> Consider a bunch of
> 100-byte tuples that you want to sort, on a 64-bit machine. IIUC,
> we're going to round the request up to 128 bytes and then add 16 bytes
> of overhead for 2 pointers before the beginning of the chunk.

I think we add the 16bytes overhead before rounding up, but I might
misremember.

> That's
> 44% overhead. For small memory contexts like the ones that store
> syscache entries it's really hard to do any better, but there are much
> better algorithms (which I am investigating) for cases where you
> expect to allocate a large amount of memory.

What I was working on was a slab allocator where, in addition to default
sizes like the current ones, you could create specific slabs for certain
sizes one knew would be frequently be used which then would a) be faster
to allocate b) have lower space overhead c) were guaranteed not to cause
fragmentation.

While I agree that the memory consumption is one concern, I am more
worried about the time it eats up in many workloads.

> > So, for a) what if we, instead of storing plain offsets have something like:
> > typedef struct sptr
> > {
> > Offset sptr_self;
> > Offset sptr_offset;
> > } sptr;
>
> I like the name Offset, but I'm firmly convinced that a relative
> pointer needs to be the same size as a regular pointer. 8 bytes for a
> pointer is already painfully large in some contexts; 16 is just too
> much.

I don't find that all that convincing. In 90% of the cases only a few
pointers will be needed, for those something like the above will be
helpful, allow bounds checking and all. For the 10% of other cases, we
can still open-code stuff. Quite possibly we would want to avoid using
8 bytes there anyway.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: same-address mappings vs. relative pointers
Date:	2013-12-11 10:42:25
Message-ID:	6595D7FC-7EB4-4A8D-80EF-AF81766BF5C8@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec5, 2013, at 15:44 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> There might be some ugly compiler dependent magic we could do. Depending
> on how we decide to declare offsets. Like (very, very roughly)
>
> #define relptr(type, struct_name, varname) union struct_name##_##varname{ \
> type relptr_type; \
> Offset relptr_off;
> }
>
> And then, for accessing have:
> #define relptr_access(seg, off) \
> typeof(off.relptr_type)* (((char *)seg->base_address) + off.relptr_off)
>
> But boy, that's ugly.

Well, uglyness we can live with, especially if it's less ugly than the
alternatives. But I'm afraid is also unportable - typeof() is a GCC
extension, not a part of ANSI C, no?

best regards,
Florian Pflug

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: same-address mappings vs. relative pointers
Date:	2013-12-11 10:47:37
Message-ID:	20131211104737.GA1721@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-12-11 11:42:25 +0100, Florian Pflug wrote:
> On Dec5, 2013, at 15:44 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > There might be some ugly compiler dependent magic we could do. Depending
> > on how we decide to declare offsets. Like (very, very roughly)
> >
> > #define relptr(type, struct_name, varname) union struct_name##_##varname{ \
> > type relptr_type; \
> > Offset relptr_off;
> > }
> >
> > And then, for accessing have:
> > #define relptr_access(seg, off) \
> > typeof(off.relptr_type)* (((char *)seg->base_address) + off.relptr_off)
> >
> > But boy, that's ugly.
>
> Well, uglyness we can live with, especially if it's less ugly than the
> alternatives. But I'm afraid is also unportable - typeof() is a GCC
> extension, not a part of ANSI C, no?

Yes (although there's C11 stuff to do equivalent stuff afair) - I was
thinking of only doing it for compilers we support that dark magic for
and fall back to returning a void* for the others. We'll probably miss a
cast or two required on !gcc that way, but it's still likely to be less
error prone.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: same-address mappings vs. relative pointers
Date:	2013-12-11 11:37:56
Message-ID:	981FAEAC-4AEB-4F39-B4F8-725D1BFA1D52@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec11, 2013, at 11:47 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-12-11 11:42:25 +0100, Florian Pflug wrote:
>> On Dec5, 2013, at 15:44 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>> There might be some ugly compiler dependent magic we could do. Depending
>>> on how we decide to declare offsets. Like (very, very roughly)
>>>
>>> #define relptr(type, struct_name, varname) union struct_name##_##varname{ \
>>> type relptr_type; \
>>> Offset relptr_off;
>>> }
>>>
>>> And then, for accessing have:
>>> #define relptr_access(seg, off) \
>>> typeof(off.relptr_type)* (((char *)seg->base_address) + off.relptr_off)
>>>
>>> But boy, that's ugly.
>>
>> Well, uglyness we can live with, especially if it's less ugly than the
>> alternatives. But I'm afraid is also unportable - typeof() is a GCC
>> extension, not a part of ANSI C, no?
>
> Yes (although there's C11 stuff to do equivalent stuff afair) - I was
> thinking of only doing it for compilers we support that dark magic for
> and fall back to returning a void* for the others. We'll probably miss a
> cast or two required on !gcc that way, but it's still likely to be less
> error prone.

Would it? For this to catch type mismatches, you'd both need to develop
on a typeof-supporting compiler *and* don't cast the result of relptr_access().
But you can't really do that, because the code will then fail on compilers
which don't support typeof()...

What we could do, I guess, is to pass the type to relptr_access() and to
relptr(), and let the compiler verify that they are the same.

Something like

#define relptr(type) union { \
type relptr_type; \
Offset relptr_off; \
}

#define relptr_access(type, seg, rptr) \
(type)( \
(rptr.relptr_type - (type)0), \
((char*)seg->base_address) + rptr.relptr_off \
)

And, yes, ouch ;-)

best regards,
Florian Pflug

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: same-address mappings vs. relative pointers
Date:	2013-12-11 11:49:35
Message-ID:	20131211114935.GA24772@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-12-11 12:37:56 +0100, Florian Pflug wrote:
> On Dec11, 2013, at 11:47 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2013-12-11 11:42:25 +0100, Florian Pflug wrote:
> > Yes (although there's C11 stuff to do equivalent stuff afair) - I was
> > thinking of only doing it for compilers we support that dark magic for
> > and fall back to returning a void* for the others. We'll probably miss a
> > cast or two required on !gcc that way, but it's still likely to be less
> > error prone.
>
>
> Would it? For this to catch type mismatches, you'd both need to develop
> on a typeof-supporting compiler *and* don't cast the result of relptr_access().
> But you can't really do that, because the code will then fail on compilers
> which don't support typeof()...

Yea, right.

> What we could do, I guess, is to pass the type to relptr_access() and to
> relptr(), and let the compiler verify that they are the same.

Tom and I actually added a macro that's helpful for that recently:
AssertVariableIsOfType(). With that we should be able to get something
reasonable, failing at compile time, with a useful error message even ;)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services