Re: POSIX question

Lists: pgsql-hackers
From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: "Markus Wanner" <markus(at)bluegap(dot)ch>
Subject: POSIX question
Date: 2011-06-20 13:27:32
Message-ID: 86e178841732f74492e1c9a0811814fa@mail.softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

I had some idea with hugepagse, and I read why PostgreSQL doesn't
support POSIX (need of nattach). During read about POSIX/SysV I found
this (thread about dynamic chunking shared memory).

http://archives.postgresql.org/pgsql-hackers/2010-08/msg00586.php

When playing with mmap I done some approach how to deal with growing
files, so...

Maybe this approach could resolve both of above problems (POSIX and
dynamic shared memory). Here is idea:

1. mmap some large amount of anonymous virtual memory (this will be
maximum size of shared memory).
2. init small SysV chunk for shmem header (to keep "fallout"
requirements)
3. SysV remap is Linux specific so unmap few 1st vm pages of step 1.
and attach there (2.)
3. a. Lock header when adding chunks (1st chunk is header) (we don't
want concurrent chunk allocation)
4. allocate some other chunks of shared memory (POSIX is the best way)
and put it in shmem header, put there information like chunk id/name, is
this POSIX or SysV, some useful flags (hugepage?) needed by reattaching,
attach those in 1.
4b. unlock 3a

Point 1. will no eat memory, as memory allocation is delayed and in
64bit platforms you may reserve quite huge chunk of this, and in future
it may be possible using mmap / munmap to concat chunks / defrag it etc.

Mmap guarants that mmaping with mmap_fixed over already mmaped area
will unmap old.

A working "preview" changeset applied for sysv_memory.c maybe quite
small.

If someone will want to "extend" memory, he may add new chunk (ofcourse
to keep header memory continuous number of chunks is limited).

What do you think about this?

Regards,
Radek


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Radosław Smogura <rsmogura(at)softperience(dot)eu>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>, "Markus Wanner" <markus(at)bluegap(dot)ch>
Subject: Re: POSIX question
Date: 2011-06-20 14:16:58
Message-ID: A42AB4B6-851C-491E-A43D-3EC5436F8D00@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun20, 2011, at 15:27 , Radosław Smogura wrote:
> 1. mmap some large amount of anonymous virtual memory (this will be maximum size of shared memory).
> ...
> Point 1. will no eat memory, as memory allocation is delayed and in 64bit platforms you may reserve quite huge chunk of this, and in future it may be possible using mmap / munmap to concat chunks / defrag it etc.

I think this breaks with strict overcommit settings (i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell the kernel (or glibc) to simply reserve a chunk of virtual address space for further user. Not sure if there's a API for that...

best regards,
Florian Pflug


From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>, "Markus Wanner" <markus(at)bluegap(dot)ch>
Subject: Re: POSIX question
Date: 2011-06-20 14:39:01
Message-ID: 201106201639.02157.rsmogura@softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Florian Pflug <fgp(at)phlo(dot)org> Monday 20 of June 2011 16:16:58
> On Jun20, 2011, at 15:27 , Radosław Smogura wrote:
> > 1. mmap some large amount of anonymous virtual memory (this will be
> > maximum size of shared memory). ...
> > Point 1. will no eat memory, as memory allocation is delayed and in 64bit
> > platforms you may reserve quite huge chunk of this, and in future it may
> > be possible using mmap / munmap to concat chunks / defrag it etc.
>
> I think this breaks with strict overcommit settings (i.e.
> vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell
> the kernel (or glibc) to simply reserve a chunk of virtual address space
> for further user. Not sure if there's a API for that...
>
> best regards,
> Florian Pflug

This may be achived by many other things, like mmap /dev/null.

Regards,
Radek


From: Florian Weimer <fweimer(at)bfk(dot)de>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Radosław Smogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, "Markus Wanner" <markus(at)bluegap(dot)ch>
Subject: Re: POSIX question
Date: 2011-06-20 14:57:40
Message-ID: 82ei2oczyz.fsf@mid.bfk.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Florian Pflug:

> I think this breaks with strict overcommit settings
> (i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need a
> way to tell the kernel (or glibc) to simply reserve a chunk of virtual
> address space for further user. Not sure if there's a API for that...

mmap with PROT_NONE and subsequent update with mprotect does this on
Linux.

(It's not clear to me what this is trying to solve, though.)

--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Radosław Smogura <rsmogura(at)softperience(dot)eu>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>, "Markus Wanner" <markus(at)bluegap(dot)ch>
Subject: Re: POSIX question
Date: 2011-06-20 15:01:40
Message-ID: 91DB9E5D-29D6-40E8-9C32-FA3EAACE2B3E@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun20, 2011, at 16:39 , Radosław Smogura wrote:
> Florian Pflug <fgp(at)phlo(dot)org> Monday 20 of June 2011 16:16:58
>> On Jun20, 2011, at 15:27 , Radosław Smogura wrote:
>>> 1. mmap some large amount of anonymous virtual memory (this will be
>>> maximum size of shared memory). ...
>>> Point 1. will no eat memory, as memory allocation is delayed and in 64bit
>>> platforms you may reserve quite huge chunk of this, and in future it may
>>> be possible using mmap / munmap to concat chunks / defrag it etc.
>>
>> I think this breaks with strict overcommit settings (i.e.
>> vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell
>> the kernel (or glibc) to simply reserve a chunk of virtual address space
>> for further user. Not sure if there's a API for that...
>>
>> best regards,
>> Florian Pflug
>
> This may be achived by many other things, like mmap /dev/null.

Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?

Or at least this is what I always thought glibc does when you malloc()
are large block at once. (This allows it to actually return the memory
to the kernel once you free() it, which isn't possible if the memory
was allocated simply by extending the heap).

You can work around this by mmap()ing an actual file, because then
the kernel knows it can use the file as backing store and thus doesn't
need to reserve actual physical memory. (In a way, this just adds
additional swap space). Doesn't seem very clean though...

Even if there's a way to work around a strict overcommit setting, unless
the workaround is a syscall *explicitly* designed for that purpose, I'd
be very careful with using it. You might just as well be exploiting a
bug in the overcommit accounting logic and future kernel versions may
simply choose to fix the bug...

best regards,
Florian Pflug


From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>, "Markus Wanner" <markus(at)bluegap(dot)ch>
Subject: Re: POSIX question
Date: 2011-06-20 15:05:48
Message-ID: 201106201705.48260.rsmogura@softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Florian Pflug <fgp(at)phlo(dot)org> Monday 20 of June 2011 17:01:40
> On Jun20, 2011, at 16:39 , Radosław Smogura wrote:
> > Florian Pflug <fgp(at)phlo(dot)org> Monday 20 of June 2011 16:16:58
> >
> >> On Jun20, 2011, at 15:27 , Radosław Smogura wrote:
> >>> 1. mmap some large amount of anonymous virtual memory (this will be
> >>> maximum size of shared memory). ...
> >>> Point 1. will no eat memory, as memory allocation is delayed and in
> >>> 64bit platforms you may reserve quite huge chunk of this, and in
> >>> future it may be possible using mmap / munmap to concat chunks /
> >>> defrag it etc.
> >>
> >> I think this breaks with strict overcommit settings (i.e.
> >> vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to
> >> tell the kernel (or glibc) to simply reserve a chunk of virtual address
> >> space for further user. Not sure if there's a API for that...
> >>
> >> best regards,
> >> Florian Pflug
> >
> > This may be achived by many other things, like mmap /dev/null.
>
> Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?
>
> Or at least this is what I always thought glibc does when you malloc()
> are large block at once. (This allows it to actually return the memory
> to the kernel once you free() it, which isn't possible if the memory
> was allocated simply by extending the heap).
>
> You can work around this by mmap()ing an actual file, because then
> the kernel knows it can use the file as backing store and thus doesn't
> need to reserve actual physical memory. (In a way, this just adds
> additional swap space). Doesn't seem very clean though...
>
> Even if there's a way to work around a strict overcommit setting, unless
> the workaround is a syscall *explicitly* designed for that purpose, I'd
> be very careful with using it. You might just as well be exploiting a
> bug in the overcommit accounting logic and future kernel versions may
> simply choose to fix the bug...
>
> best regards,
> Florian Pflug

I'm sure at 99%. When I ware "playing" with mmap I preallocated, probably,
about 100GB of memory.

Regards,
Radek


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Radosław Smogura <rsmogura(at)softperience(dot)eu>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>, "Markus Wanner" <markus(at)bluegap(dot)ch>
Subject: Re: POSIX question
Date: 2011-06-20 15:07:55
Message-ID: 01800D55-D7F2-452D-B49F-3EF8A4BBF566@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun20, 2011, at 17:05 , Radosław Smogura wrote:
> I'm sure at 99%. When I ware "playing" with mmap I preallocated, probably,
> about 100GB of memory.

You need to set vm.overcommit_memory to "2" to see the difference. Did
you do that?

You can do that either with "echo 2 > /proc/sys/vm/overcommit_memory"
or by editing /etc/sysctl.conf and issuing "sysctl -p".

best regards,
Florian Pflug


From: Roger Leigh <rleigh(at)codelibre(dot)net>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POSIX question
Date: 2011-06-20 15:08:46
Message-ID: 20110620150846.GO6333@codelibre.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 20, 2011 at 04:16:58PM +0200, Florian Pflug wrote:
> On Jun20, 2011, at 15:27 , Radosław Smogura wrote:
> > 1. mmap some large amount of anonymous virtual memory (this will be maximum size of shared memory).
> > ...
> > Point 1. will no eat memory, as memory allocation is delayed and in 64bit platforms you may reserve quite huge chunk of this, and in future it may be possible using mmap / munmap to concat chunks / defrag it etc.
>
> I think this breaks with strict overcommit settings (i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell the kernel (or glibc) to simply reserve a chunk of virtual address space for further user. Not sure if there's a API for that...

I run discless swapless cluster systems with zero overcommit (i.e.
it's entirely disabled), which means that all operations are
strict success/fail due to allocation being immediate. mmap of a
large amount of anonymous memory would almost certainly fail on
such a setup--you definitely can't assume that a large anonymous
mmap will always succeed, since there is no delayed allocation.

[we do in reality have a small overcommit allowance to permit
efficient fork(2), but it's tiny and (in this context) irrelevant]

Regards,
Roger

--
.''`. Roger Leigh
: :' : Debian GNU/Linux http://people.debian.org/~rleigh/
`. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/
`- GPG Public Key: 0x25BFB848 Please GPG sign your mail.


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Radosław Smogura <rsmogura(at)softperience(dot)eu>, Florian Pflug <fgp(at)phlo(dot)org>, "Markus Wanner" <markus(at)bluegap(dot)ch>
Subject: Re: POSIX question
Date: 2011-06-20 15:09:16
Message-ID: 201106201709.16894.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday, June 20, 2011 17:05:48 Radosław Smogura wrote:
> Florian Pflug <fgp(at)phlo(dot)org> Monday 20 of June 2011 17:01:40
>
> > On Jun20, 2011, at 16:39 , Radosław Smogura wrote:
> > > Florian Pflug <fgp(at)phlo(dot)org> Monday 20 of June 2011 16:16:58
> > >
> > >> On Jun20, 2011, at 15:27 , Radosław Smogura wrote:
> > >>> 1. mmap some large amount of anonymous virtual memory (this will be
> > >>> maximum size of shared memory). ...
> > >>> Point 1. will no eat memory, as memory allocation is delayed and in
> > >>> 64bit platforms you may reserve quite huge chunk of this, and in
> > >>> future it may be possible using mmap / munmap to concat chunks /
> > >>> defrag it etc.
> > >>
> > >> I think this breaks with strict overcommit settings (i.e.
> > >> vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to
> > >> tell the kernel (or glibc) to simply reserve a chunk of virtual
> > >> address space for further user. Not sure if there's a API for that...
> > >>
> > >> best regards,
> > >> Florian Pflug
> > >
> > > This may be achived by many other things, like mmap /dev/null.
> >
> > Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?
> >
> > Or at least this is what I always thought glibc does when you malloc()
> > are large block at once. (This allows it to actually return the memory
> > to the kernel once you free() it, which isn't possible if the memory
> > was allocated simply by extending the heap).
> >
> > You can work around this by mmap()ing an actual file, because then
> > the kernel knows it can use the file as backing store and thus doesn't
> > need to reserve actual physical memory. (In a way, this just adds
> > additional swap space). Doesn't seem very clean though...
> >
> > Even if there's a way to work around a strict overcommit setting, unless
> > the workaround is a syscall *explicitly* designed for that purpose, I'd
> > be very careful with using it. You might just as well be exploiting a
> > bug in the overcommit accounting logic and future kernel versions may
> > simply choose to fix the bug...
> >
> > best regards,
> > Florian Pflug
>
> I'm sure at 99%. When I ware "playing" with mmap I preallocated, probably,
> about 100GB of memory.
The default setting is to allow overcommit.

Andres


From: Greg Stark <stark(at)mit(dot)edu>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Radosław Smogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, Markus Wanner <markus(at)bluegap(dot)ch>
Subject: Re: POSIX question
Date: 2011-06-20 15:11:14
Message-ID: BANLkTi=E_am4oWhHM3oAzHwrO0Fz6SLrnU-YK5wm3Ld9MOGu7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 20, 2011 at 4:01 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?
>
> Or at least this is what I always thought glibc does when you malloc()

It mmaps /dev/zero actually.

--
greg


From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>, "Markus Wanner" <markus(at)bluegap(dot)ch>
Subject: Re: POSIX question
Date: 2011-06-20 15:30:31
Message-ID: 201106201730.31648.rsmogura@softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Florian Pflug <fgp(at)phlo(dot)org> Monday 20 of June 2011 17:07:55
> On Jun20, 2011, at 17:05 , Radosław Smogura wrote:
> > I'm sure at 99%. When I ware "playing" with mmap I preallocated,
> > probably, about 100GB of memory.
>
> You need to set vm.overcommit_memory to "2" to see the difference. Did
> you do that?
>
> You can do that either with "echo 2 > /proc/sys/vm/overcommit_memory"
> or by editing /etc/sysctl.conf and issuing "sysctl -p".
>
> best regards,
> Florian Pflug
I've just created 127TB mapping in Linux - maximum allowed by VM. Trying
overcommit with 0,1,2.

Regards,
Radek


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: POSIX question
Date: 2011-06-20 15:36:42
Message-ID: 201106201736.42585.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Monday, June 20, 2011 17:11:14 Greg Stark wrote:
> On Mon, Jun 20, 2011 at 4:01 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> > Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?
> >
> > Or at least this is what I always thought glibc does when you malloc()
>
> It mmaps /dev/zero actually.
As the nitpicking has already started: Afair its just passing -1 as fd and
uses the MAP_ANONYMOUS flag argument ;)

Andres


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Radosław Smogura <rsmogura(at)softperience(dot)eu>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POSIX question
Date: 2011-06-26 14:12:53
Message-ID: 4E073E65.8080805@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Radek,

On 06/20/2011 03:27 PM, Radosław Smogura wrote:
> When playing with mmap I done some approach how to deal with growing
> files, so...

Your approach seems to require a SysV alloc (for nattach) as well as
POSIX shmem and/or mmap. Adding requirements for these syscalls
certainly needs to give a good benefit for Postgres, as they presumably
pose portability issues.

> 3. a. Lock header when adding chunks (1st chunk is header) (we don't
> want concurrent chunk allocation)

Sure we don't? There are at least a dozen memory allocators for
multi-threaded applications, all trying to optimize for concurrency.
The programmer of a multi-threaded application doesn't need to care much
about concurrent allocations. He can allocate (and free) quite a lot of
tiny chunks concurrently from shared memory.

Regards

Markus Wanner