Re: Some interesting news about Linux 3.12 OOM

Lists: pgsql-hackers
From: Daniel Farina <daniel(at)heroku(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Some interesting news about Linux 3.12 OOM
Date: 2013-09-19 02:09:18
Message-ID: CAAZKuFYMh_n4BL08rBnghex=7Lt29xk2dF98rXeX0WBWep5pug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'm not sure how many of you have been tracking this but courtesy of
lwn.net I have learned that it seems that the OOM killer behavior in
Linux 3.12 will be significantly different. And by description, it
sounds like an improvement. I thought some people reading -hackers
might be interested.

Based on the description at lwn, excerpted below, it sounds like the
news might be that systems with overcommit on might return OOM when a
non-outlandish request for memory is made from the kernel.

"""
Johannes Weiner has posted a set of patches aimed at improving this
situation. Following a bunch of cleanup work, these patches make two
fundamental changes to how OOM conditions are handled in the kernel.
The first of those is perhaps the most visible: it causes the kernel
to avoid calling the OOM killer altogether for most memory allocation
failures. In particular, if the allocation is being made in response
to a system call, the kernel will just cause the system call to fail
with an ENOMEMerror rather than trying to find a process to kill. That
may cause system call failures to happen more often and in different
contexts than they used to. But, naturally, that will not be a problem
since all user-space code diligently checks the return status of every
system call and responds with well-tested error-handling code when
things go wrong.
"""

Subject to experiment, this may be some good news, as many programs,
libraries, and runtime environments that may run parallel to Postgres
on a machine are pretty lackadaisical about limiting the amount of
virtual memory charged to them, and overcommit off is somewhat
punishing in those situations if one really needed a large hash table
from Postgres or whatever. I've seen some cases here where a good
amount of VM has been reserved and caused apparent memory pressure
that cut throughput short of what should ought to be possible.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some interesting news about Linux 3.12 OOM
Date: 2013-09-19 14:12:24
Message-ID: CA+TgmoY8hJ98Y-_hG2cEyX5Z5S1LicEx6RVS1ei+f5Ae-qRPkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 18, 2013 at 10:09 PM, Daniel Farina <daniel(at)heroku(dot)com> wrote:
> I'm not sure how many of you have been tracking this but courtesy of
> lwn.net I have learned that it seems that the OOM killer behavior in
> Linux 3.12 will be significantly different. And by description, it
> sounds like an improvement. I thought some people reading -hackers
> might be interested.
>
> Based on the description at lwn, excerpted below, it sounds like the
> news might be that systems with overcommit on might return OOM when a
> non-outlandish request for memory is made from the kernel.
>
> """
> Johannes Weiner has posted a set of patches aimed at improving this
> situation. Following a bunch of cleanup work, these patches make two
> fundamental changes to how OOM conditions are handled in the kernel.
> The first of those is perhaps the most visible: it causes the kernel
> to avoid calling the OOM killer altogether for most memory allocation
> failures. In particular, if the allocation is being made in response
> to a system call, the kernel will just cause the system call to fail
> with an ENOMEMerror rather than trying to find a process to kill. That
> may cause system call failures to happen more often and in different
> contexts than they used to. But, naturally, that will not be a problem
> since all user-space code diligently checks the return status of every
> system call and responds with well-tested error-handling code when
> things go wrong.
> """
>
> Subject to experiment, this may be some good news, as many programs,
> libraries, and runtime environments that may run parallel to Postgres
> on a machine are pretty lackadaisical about limiting the amount of
> virtual memory charged to them, and overcommit off is somewhat
> punishing in those situations if one really needed a large hash table
> from Postgres or whatever. I've seen some cases here where a good
> amount of VM has been reserved and caused apparent memory pressure
> that cut throughput short of what should ought to be possible.

Yes, that does sound good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Daniel Farina <daniel(at)heroku(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some interesting news about Linux 3.12 OOM
Date: 2013-09-19 15:30:41
Message-ID: CAHyXU0xzLmyadd2AUevw3Cg2EArvcc5nSau_iNw0L-dmV4OsbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 19, 2013 at 9:12 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> But, naturally, that will not be a problem
>> since all user-space code diligently checks the return status of every
>> system call and responds with well-tested error-handling code when
>> things go wrong.

That just short circuited my sarcasm detector.

merlin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Daniel Farina <daniel(at)heroku(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some interesting news about Linux 3.12 OOM
Date: 2013-09-19 15:49:05
Message-ID: CA+TgmoZSUhv6L5y21cYvY1giqtZKvCp_vF6MOKE2Q0ksGxvAvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 19, 2013 at 11:30 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Thu, Sep 19, 2013 at 9:12 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> But, naturally, that will not be a problem
>>> since all user-space code diligently checks the return status of every
>>> system call and responds with well-tested error-handling code when
>>> things go wrong.
>
> That just short circuited my sarcasm detector.

I laughed, too, but the reality is that at least as far as PG is
concerned it's probably a truthful statement, and if it isn't, nobody
here is likely to complain about having to fix it. Yeah, there's a
lot of other code out there not as well written or maintained as PG,
but using SIGKILL as a substitute for ENOMEM because people might not
be checking the return value for malloc() is extremely heavy-handed
nannyism.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Daniel Farina <daniel(at)heroku(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some interesting news about Linux 3.12 OOM
Date: 2013-09-19 16:02:21
Message-ID: 20130919160221.GL8288@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-09-19 11:49:05 -0400, Robert Haas wrote:
> On Thu, Sep 19, 2013 at 11:30 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> > On Thu, Sep 19, 2013 at 9:12 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >>> But, naturally, that will not be a problem
> >>> since all user-space code diligently checks the return status of every
> >>> system call and responds with well-tested error-handling code when
> >>> things go wrong.
> >
> > That just short circuited my sarcasm detector.
>
> I laughed, too, but the reality is that at least as far as PG is
> concerned it's probably a truthful statement, and if it isn't, nobody
> here is likely to complain about having to fix it. Yeah, there's a
> lot of other code out there not as well written or maintained as PG,
> but using SIGKILL as a substitute for ENOMEM because people might not
> be checking the return value for malloc() is extremely heavy-handed
> nannyism.

The "problem" is that it's not just about malloc() (aka brk() and
mmap()) and friends. It's about many of the other systemcalls. Like
e.g. send() to name one of the more likely ones.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Daniel Farina <daniel(at)heroku(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some interesting news about Linux 3.12 OOM
Date: 2013-09-19 16:08:54
Message-ID: CA+TgmoZwpOk+PpCqXykWE=gWF8uBhc9aG4eyMwNYE1-c7PEW2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 19, 2013 at 12:02 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> The "problem" is that it's not just about malloc() (aka brk() and
> mmap()) and friends. It's about many of the other systemcalls. Like
> e.g. send() to name one of the more likely ones.

*shrug*

If you're using for send() and not testing for a -1 return value,
you're writing amazingly bad code anyway. And if you ARE testing for
-1, you'll probably do something at least mildly sensible with a
not-specifically-foreseen errno value, like print a message that
includes %m. That's about what we'd probably do, and I have to
imagine what most people would do.

I'm not saying it won't break anything to return a proper error code;
I'm just saying that sending SIGKILL is worse.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Daniel Farina <daniel(at)heroku(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some interesting news about Linux 3.12 OOM
Date: 2013-09-19 16:23:07
Message-ID: m2li2szspg.fsf@2ndQuadrant.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I laughed, too, but the reality is that at least as far as PG is
> concerned it's probably a truthful statement, and if it isn't, nobody
> here is likely to complain about having to fix it. Yeah, there's a
> lot of other code out there not as well written or maintained as PG,
> but using SIGKILL as a substitute for ENOMEM because people might not
> be checking the return value for malloc() is extremely heavy-handed
> nannyism.

I've been told at several instances that this has been made for the JVM
and other such programs that want to allocate huge amount of memory even
if they don't really intend to use it.

Back in the day that amount could well be greater that the actual amount
of physical memory available. So the only way to allow Java applications
on Linux was, as I've been told, to implement OOM. And as the target was
the desktop, well, have it turned on by default.

Now, I liked that story enough to never actually try and check about it,
so if some knows for real why the linux kernel appears so stupid in its
choice of implementing OOM and turning it on by default…

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Daniel Farina <daniel(at)heroku(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some interesting news about Linux 3.12 OOM
Date: 2013-09-19 16:27:09
Message-ID: 20130919162709.GC15812@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-09-19 18:23:07 +0200, Dimitri Fontaine wrote:
> I've been told at several instances that this has been made for the JVM
> and other such programs that want to allocate huge amount of memory even
> if they don't really intend to use it.

That's not really related - what you describe is memory overcommitting
(which as lots of uses besides JVMs). That's not removed by the changes
references upthread.
What has changed is how to react to situations where memory has been
overcommitted but is now actually needed.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Daniel Farina <daniel(at)heroku(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some interesting news about Linux 3.12 OOM
Date: 2013-09-19 16:29:38
Message-ID: m2eh8kzsel.fsf@2ndQuadrant.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> What has changed is how to react to situations where memory has been
> overcommitted but is now actually needed.

Sure. You either have a failure at malloc() or usage, over commit is all
about never failing at malloc(), but now you have to deal with OOM
conditions in creative way, like with the OOM Killer.

Anyways,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Daniel Farina <daniel(at)heroku(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some interesting news about Linux 3.12 OOM
Date: 2013-09-24 17:07:34
Message-ID: 5241C6D6.5040709@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

All,

I've send kernel.org a message that we're keen on seeing these changes
get committed.

BTW, in the future if anyone sees kernel.org contemplating a patch which
helps or hurts Postgres, don't hesiate to speak up to them. They don't
get nearly enough feedback from DB developers.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Some interesting news about Linux 3.12 OOM
Date: 2013-09-24 17:11:18
Message-ID: 5241C7B6.9050906@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

All,

I've send kernel.org a message that we're keen on seeing these changes
become committed.

BTW, in the future if anyone sees kernel.org contemplating a patch which
helps or hurts Postgres, don't hesiate to speak up to them. They don't
get nearly enough feedback from DB developers.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Daniel Farina <daniel(at)heroku(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some interesting news about Linux 3.12 OOM
Date: 2013-09-24 23:15:29
Message-ID: CAAZKuFZPMkxDRwDsUUFOaP0cbeCMeD9mbEkTx2bt8gsX-Oietw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sep 24, 2013 10:12 AM, "Josh Berkus" <josh(at)agliodbs(dot)com> wrote:
>
> All,
>
> I've send kernel.org a message that we're keen on seeing these changes
> become committed.

I thought it was merged already in 3.12. There are a few related
patches, but here's one:

commit 519e52473ebe9db5cdef44670d5a97f1fd53d721
Author: Johannes Weiner <hannes(at)cmpxchg(dot)org>
Date: Thu Sep 12 15:13:42 2013 -0700

mm: memcg: enable memcg OOM killer only for user faults

System calls and kernel faults (uaccess, gup) can handle an out of memory
situation gracefully and just return -ENOMEM.

Enable the memcg OOM killer only for user faults, where it's really the
only option available.

Signed-off-by: Johannes Weiner <hannes(at)cmpxchg(dot)org>
Acked-by: Michal Hocko <mhocko(at)suse(dot)cz>
Cc: David Rientjes <rientjes(at)google(dot)com>
Cc: KAMEZAWA Hiroyuki <kamezawa(dot)hiroyu(at)jp(dot)fujitsu(dot)com>
Cc: azurIt <azurit(at)pobox(dot)sk>
Cc: KOSAKI Motohiro <kosaki(dot)motohiro(at)jp(dot)fujitsu(dot)com>
Signed-off-by: Andrew Morton <akpm(at)linux-foundation(dot)org>
Signed-off-by: Linus Torvalds <torvalds(at)linux-foundation(dot)org>

$ git tag --contains 519e52473ebe9db5cdef44670d5a97f1fd53d721
v3.12-rc1
v3.12-rc2

Searching for recent work by Johannes Weiner shows the pertinent stuff
more exhaustively.

> BTW, in the future if anyone sees kernel.org contemplating a patch which
> helps or hurts Postgres, don't hesiate to speak up to them. They don't
> get nearly enough feedback from DB developers.

I don't hesitate, most of the time I simply don't know.


From: Greg Stark <stark(at)mit(dot)edu>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some interesting news about Linux 3.12 OOM
Date: 2013-09-25 15:00:55
Message-ID: CAM-w4HN=Z-hBEz8GcvQ1F-MWLnMCyRGMSrJBbNGGO4PsQo9Hew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 25, 2013 at 12:15 AM, Daniel Farina <daniel(at)heroku(dot)com> wrote:

> Enable the memcg OOM killer only for user faults, where it's really the
> only option available.
>

Is this really a big deal? I would expect most faults to be user faults.

It's certainly a big deal that we need to ensure we can handle ENOMEM from
syscalls and library functions we weren't expecting to return it. But I
don't expect it to actually reduce the OOM killing sprees by much.

--
greg


From: Daniel Farina <daniel(at)heroku(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some interesting news about Linux 3.12 OOM
Date: 2013-09-27 07:07:31
Message-ID: CAAZKuFauTjSOS+xQKDfW47yp2-iThJfi5mLxDWc8UYydJpB4mw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 25, 2013 at 8:00 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
>
> On Wed, Sep 25, 2013 at 12:15 AM, Daniel Farina <daniel(at)heroku(dot)com> wrote:
>>
>> Enable the memcg OOM killer only for user faults, where it's really the
>> only option available.
>
>
> Is this really a big deal? I would expect most faults to be user faults.
>
> It's certainly a big deal that we need to ensure we can handle ENOMEM from
> syscalls and library functions we weren't expecting to return it. But I
> don't expect it to actually reduce the OOM killing sprees by much.

Hmm, I see what you mean. I have been reading through the mechanism:
I got too excited about 'allocations by system calls', because I
thought that might mean brk and friends, except that's not much of an
allocation at all, just reservation. I think.

There is some interesting stuff coming in along with these patches in
bringing the user-space memcg OOM handlers up to snuff that may make
it profitable to issue SIGTERM to backends when a safety margin is
crossed (too bad the error messages will be confusing in that case).
I was rather hoping that a regular ENOMEM could be injected by this
mechanism the next time a syscall is touched (unknown), but I'm not
confident if this is made easier or not, one way or another. One
could imagine the kernel injecting such a fault when the amount of
memory being consumed starts to look hairy, but I surmise part of the
impetus for userspace handling of that is to avoid getting into that
particular heuristics game.

Anyway, I did do some extensive study of cgroups and memcg's
implementation in particular and found it not really practical for
Postgres use unless one was happy with lots and lots of database
restarts, and this work still gives me some hope to try again, even if
smaller modifications still seem necessary.