Re: NUMA packaging and patch

Lists: pgsql-hackers
From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: NUMA packaging and patch
Date: 2014-06-08 22:45:01
Message-ID: 1402267501.41111.YahooMailNeo@web122304.mail.ne1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I ran into a situation where a machine with 4 NUMA memory nodes and
40 cores had performance problems due to NUMA.  The problems were
worst right after they rebooted the OS and warmed the cache by
running a script of queries to read all tables.  These were all run
on a single connection.  As it turned out, the size of the database
was just over one-quarter of the size of RAM, and with default NUMA
policies both the OS cache for the database and the PostgreSQL
shared memory allocation were placed on a single NUMA segment, so
access to the CPU package managing that segment became a
bottleneck.  On top of that, processes which happened to run on the
CPU package which had all the cached data had to allocate memory
for local use on more distant memory because there was none left in
the more local memory.

Through normal operations, things eventually tended to shift around
and get better (after several hours of heavy use with substandard
performance).  I ran some benchmarks and found that even in
long-running tests, spreading these allocations among the memory
segments showed about a 2% benefit in a read-only load.  The
biggest difference I saw in a long-running read-write load was
about a 20% hit for unbalanced allocations, but I only saw that
once.  I talked to someone at PGCon who managed to engineer much
worse performance hits for an unbalanced load, although the
circumstances were fairly artificial.  Still, fixing this seems
like something worth doing if further benchmarks confirm benefits
at this level.

By default, the OS cache and buffers are allocated in the memory
node with the shortest "distance" from the CPU a process is running
on.  This is determined by a the "cpuset" associated with the
process which reads or writes the disk page.  Typically a NUMA
machine starts with a single cpuset with a policy specifying this
behavior.  Fixing this aspect of things seems like an issue for
packagers, although we should probably document it for those
running from their own source builds.

To set an alternate policy for PostgreSQL, you first need to find
or create the location for cpuset specification, which uses a
filesystem in a way similar to the /proc directory.  On a machine
with more than one memory node, the appropriate filesystem is
probably already mounted, although different distributions use
different filesystem names and mount locations.  I will illustrate
the process on my Ubuntu machine.  Even though it has only one
memory node (and so, this makes no difference), I have it handy at
the moment to confirm the commands as I put them into the email.

# Sysadmin must create the root cpuset if not already done.  (On a
# system with NUMA memory, this will probably already be mounted.)
# Location and options can vary by distro.

sudo sudo mkdir /dev/cpuset
sudo mount -t cpuset none /dev/cpuset

# Sysadmin must create a cpuset for postgres and configure
# resources.  This will normally be all cores and all RAM.  This is
# where we specify that this cpuset will spread pages among its
# memory nodes.

sudo mkdir /dev/cpuset/postgres
sudo /bin/bash -c "echo 0-3 >/dev/cpuset/postgres/cpus"
sudo /bin/bash -c "echo 0 >/dev/cpuset/postgres/mems"
sudo /bin/bash -c "echo 1 >/dev/cpuset/postgres/memory_spread_page"

# Sysadmin must grant permissions to the desired setting(s).
# This could be by user or group.

sudo chown postgres /dev/cpuset/postgres/tasks

# The pid of postmaster or an ancestor process must be written to
# the tasks "file" of the cpuset.  This can be a shell from which
# pg_ctl is run, at least for bash shells.  It could also be
# written by the postmaster itself, essentially as an extra pid
# file.  Possible snippet from a service script:

echo $$ >/dev/cpuset/postgres/tasks
pg_ctl start ...

Where the OS cache is larger than shared_buffers, the above is
probably more important than the attached patch, which causes the
main shared memory segment to be spread among all available memory
nodes.  This patch only compiles in the relevant code if configure
is run using the --with-libnuma option, in which case a dependency
on the numa library is created.  It is v3 to avoid confusion with
earlier versions I have shared with a few people off-list.  (The
only difference from v2 is fixing bitrot.)

I'll add it to the next CF.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
numa-interleave-shared-buffers-v3.diff text/x-diff 5.6 KB

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-06-09 14:49:03
Message-ID: CAHyXU0x-t9x46baGqoV4Bm=PAgzkWg2-xHng73kX7v1YvDdkxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jun 8, 2014 at 5:45 PM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> I ran into a situation where a machine with 4 NUMA memory nodes and
> 40 cores had performance problems due to NUMA. The problems were
> worst right after they rebooted the OS and warmed the cache by
> running a script of queries to read all tables. These were all run
> on a single connection. As it turned out, the size of the database
> was just over one-quarter of the size of RAM, and with default NUMA
> policies both the OS cache for the database and the PostgreSQL
> shared memory allocation were placed on a single NUMA segment, so
> access to the CPU package managing that segment became a
> bottleneck. On top of that, processes which happened to run on the
> CPU package which had all the cached data had to allocate memory
> for local use on more distant memory because there was none left in
> the more local memory.
>
> Through normal operations, things eventually tended to shift around
> and get better (after several hours of heavy use with substandard
> performance). I ran some benchmarks and found that even in
> long-running tests, spreading these allocations among the memory
> segments showed about a 2% benefit in a read-only load. The
> biggest difference I saw in a long-running read-write load was
> about a 20% hit for unbalanced allocations, but I only saw that
> once. I talked to someone at PGCon who managed to engineer much
> worse performance hits for an unbalanced load, although the
> circumstances were fairly artificial. Still, fixing this seems
> like something worth doing if further benchmarks confirm benefits
> at this level.
>
> By default, the OS cache and buffers are allocated in the memory
> node with the shortest "distance" from the CPU a process is running
> on. This is determined by a the "cpuset" associated with the
> process which reads or writes the disk page. Typically a NUMA
> machine starts with a single cpuset with a policy specifying this
> behavior. Fixing this aspect of things seems like an issue for
> packagers, although we should probably document it for those
> running from their own source builds.
>
> To set an alternate policy for PostgreSQL, you first need to find
> or create the location for cpuset specification, which uses a
> filesystem in a way similar to the /proc directory. On a machine
> with more than one memory node, the appropriate filesystem is
> probably already mounted, although different distributions use
> different filesystem names and mount locations. I will illustrate
> the process on my Ubuntu machine. Even though it has only one
> memory node (and so, this makes no difference), I have it handy at
> the moment to confirm the commands as I put them into the email.
>
> # Sysadmin must create the root cpuset if not already done. (On a
> # system with NUMA memory, this will probably already be mounted.)
> # Location and options can vary by distro.
>
> sudo sudo mkdir /dev/cpuset
> sudo mount -t cpuset none /dev/cpuset
>
> # Sysadmin must create a cpuset for postgres and configure
> # resources. This will normally be all cores and all RAM. This is
> # where we specify that this cpuset will spread pages among its
> # memory nodes.
>
> sudo mkdir /dev/cpuset/postgres
> sudo /bin/bash -c "echo 0-3 >/dev/cpuset/postgres/cpus"
> sudo /bin/bash -c "echo 0 >/dev/cpuset/postgres/mems"
> sudo /bin/bash -c "echo 1 >/dev/cpuset/postgres/memory_spread_page"
>
> # Sysadmin must grant permissions to the desired setting(s).
> # This could be by user or group.
>
> sudo chown postgres /dev/cpuset/postgres/tasks
>
> # The pid of postmaster or an ancestor process must be written to
> # the tasks "file" of the cpuset. This can be a shell from which
> # pg_ctl is run, at least for bash shells. It could also be
> # written by the postmaster itself, essentially as an extra pid
> # file. Possible snippet from a service script:
>
> echo $$ >/dev/cpuset/postgres/tasks
> pg_ctl start ...
>
> Where the OS cache is larger than shared_buffers, the above is
> probably more important than the attached patch, which causes the
> main shared memory segment to be spread among all available memory
> nodes. This patch only compiles in the relevant code if configure
> is run using the --with-libnuma option, in which case a dependency
> on the numa library is created. It is v3 to avoid confusion with
> earlier versions I have shared with a few people off-list. (The
> only difference from v2 is fixing bitrot.)
>
> I'll add it to the next CF.

Hm, your patch seems to boil down to interleave_memory(start, size,
numa_all_nodes_ptr) inside PGSharedMemoryCreate(). I've read your
email a couple of times and am a little hazy around a couple of
points, in particular: "the above is probably more important than the
attached patch". So I have a couple of questions:

*) There is a lot of advice floating around (for example here:
http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)
to instruct operators to disable zone_reclaim. Will your changes
invalidate any of that advice?

*) is there any downside to enabling --with-libnuma if you have
support? Do you expect packagers will enable it generally? Why not
just always build it in (if configure allows it) and rely on a GUC if
there is some kind of tradeoff (and if there is one, what kinds of
things are you looking for to manage it)?

*) The bash script above, what problem does the 'alternate policy' solve?

*) What kinds of improvements (even if in very general terms) will we
see from better numa management? Are there further optimizations
possible?

merlin


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-06-09 15:59:03
Message-ID: 1402329543.78687.YahooMailNeo@web122304.mail.ne1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Sun, Jun 8, 2014 at 5:45 PM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:

> Hm, your patch seems to boil down to
>   interleave_memory(start, size, numa_all_nodes_ptr)
> inside PGSharedMemoryCreate().

That's the functional part -- the rest is about not breaking the
builds for environments which are not NUMA-aware.

> I've read your email a couple of times and am a little hazy
> around a couple of points, in particular: "the above is probably
> more important than the attached patch".  So I have a couple of
> questions:
>
> *) There is a lot of advice floating around (for example here:
> http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html )
> to instruct operators to disable zone_reclaim.  Will your changes
> invalidate any of that advice?

I expect that it will make the need for that far less acute,
although it is probably still best to disable zone_reclaim (based
on the documented conditions under which disabling it makes sense).

> *) is there any downside to enabling --with-libnuma if you have
> support?

Not that I can see.  There are two additional system calls on
postmaster start-up.  I don't expect the time those take to be
significant.

> Do you expect packagers will enable it generally?

I suspect so.

> Why not just always build it in (if configure allows it) and rely
> on a GUC if there is some kind of tradeoff (and if there is one,
> what kinds of things are you looking for to manage it)?

If a build is done on a machine with the NUMA library, and the
executable is deployed on a machine without it, the postmaster will
get an error on the missing library.  I talked about this briefly
with Tom in Ottawa, and he thought that it would be up to packagers
to create a dependency on the library if they build PostgreSQL
using the --with-libnuma option.  The reason to require the option
is so that a build is not created which won't run on target
machines if a packagers does nothing to deal with NUMA.

> *) The bash script above, what problem does the 'alternate
> policy' solve?

By default, all OS buffers and cache is located in the memory node
closest to the process which does the read or write which first
causes it to be used.  For something like the cp command, that
probably makes sense.  For something like PostgreSQL it can lead to
unbalanced placement of shared resources (like pages in shared
tables and indexes).

> *) What kinds of improvements (even if in very general terms)
> will we see from better numa management?  Are there further
> optimizations possible?

When I spread both OS cache and PostgreSQL shared memory, I got
about 2% better performance overall for a read-only load on a 4
node system which started with everything on one node.  I used
pgbench and picked a scale which put the database size at about 25%
of machine memory before I initialized the database, so that one
memory node was 100% filled with minimal "spill" to the other
nodes.  The run times between the two cases had very minimal
overlap.  The balanced memory usage had more consistent results;
the unbalance load had more variable performance timings, with a
rare run showing better than all the balanced times.

I didn't spend as much time with read/write benchmarks but those
seemed overall worse for the unbalance load, and one outlier on the
bad side was about 20% below the (again, pretty tightly clustered)
times for the balanced load.

These tests were designed to try to create a pretty bad case for
the unbalanced load in a default cpuset configuration and just an
unlucky sizing of the working set relative to a memory node size.
At PGCon I had a discussion over lunch with someone who saw far
worse performance from unbalance memory, but he carefully
engineered a really bad case by using one cpuset to force all data
into one node, and then another cpuset to force PostgreSQL to run
only on cores from which access to that node was relatively slow.
If I remember correctly, he saw about 20% of the throughput that way
versus using the same cores with balanced memory usage.  He
conceded that this was a pretty artificial case, and you would
have to be *trying* to hurt performance to set things up that way,
but he wanted to establish a "worst case" so that he had a hard
bounding of what the maximum possible benefit from balancing load
might be.

There is definitely a need for more benchmarks and benchmarks on
more environments, but my preliminary tests all looked favorable to
the combination of this patch and the cpuset changes.  I would have
posted this months ago if I had found enough time to do more
benchmarks and put together a nice presentation of the results, but
I figured it was a good idea to put this in front of people even
with only preliminary results, so that if others were interested in
doing so they could see what results they got in their
environments or with workloads I had not considered.

I will note that given the wide differences I saw between run times
with the unbalanced memory usage, there must be some variable that
matters which I was not properly controlling.  I still haven't
figured out what that was.  It might be something as simple as a
particular process (like the checkpoint or bgwriter process?)
landing on the fully-allocated memory node versus landing somewhere
else.

I will also note that if the buffers and cache are populated by
small OLTP queries running on a variety of cores, the data can be
spread just by happenstance, and in that case this patch should not
be expected to make any difference at all.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-06-09 16:09:59
Message-ID: 20140609160959.GD8406@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-06-09 08:59:03 -0700, Kevin Grittner wrote:
> > *) There is a lot of advice floating around (for example here:
> > http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html )
> > to instruct operators to disable zone_reclaim.  Will your changes
> > invalidate any of that advice?
>
> I expect that it will make the need for that far less acute,
> although it is probably still best to disable zone_reclaim (based
> on the documented conditions under which disabling it makes sense).

I think it'll still be important unless you're running an OLTP workload
(i.e. minimal per backend allocations) and your entire workload fits
into shared buffers. What zone_reclaim > 0 essentially does is to never
allocate memory from remote nodes. I.e. it will throw away all numa node
local OS cache to satisfy a memory allocation (including
pagefaults).
I honestly wouldn't expect this to make a huge difference *wrt*
zone_reclaim_mode.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-06-09 17:00:08
Message-ID: 1402333208.50953.YahooMailNeo@web122302.mail.ne1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-06-09 08:59:03 -0700, Kevin Grittner wrote:
>>> *) There is a lot of advice floating around (for example here:
>>> http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html )
>>> to instruct operators to disable zone_reclaim.  Will your changes
>>> invalidate any of that advice?
>>
>> I expect that it will make the need for that far less acute,
>> although it is probably still best to disable zone_reclaim (based
>> on the documented conditions under which disabling it makes sense).
>
> I think it'll still be important unless you're running an OLTP workload
> (i.e. minimal per backend allocations) and your entire workload fits
> into shared buffers. What zone_reclaim > 0 essentially does is to never
> allocate memory from remote nodes. I.e. it will throw away all numa node
> local OS cache to satisfy a memory allocation (including
> pagefaults).

I don't think that cpuset spreading of OS buffers and cache, and
the patch to spread shared memory, will make too much difference
unless the working set is fully cached.  Where I have seen the
biggest problems is when the active set > one memory node and <
total machine RAM.  I would agree that unless this patch is
providing benefit for such a fully-cached load, it won't make any
difference regarding the need for zone_reclaim_mode.  Where the
data is heavily cached, zone_reclaim > 0 might discard some cached
pages to allow, say, a RAM sort to be done in faster memory (for
the current process's core), so it might be a wash or even make
zone_reclaim > 0 a win.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-06-10 14:54:14
Message-ID: CA+TgmoYaUh0EYUUba7MMcPu7ZwyqG_fRDa6J=PMgyBf76dU7nw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 9, 2014 at 1:00 PM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> On 2014-06-09 08:59:03 -0700, Kevin Grittner wrote:
>>>> *) There is a lot of advice floating around (for example here:
>>>> http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html )
>>>> to instruct operators to disable zone_reclaim. Will your changes
>>>> invalidate any of that advice?
>>>
>>> I expect that it will make the need for that far less acute,
>>> although it is probably still best to disable zone_reclaim (based
>>> on the documented conditions under which disabling it makes sense).
>>
>> I think it'll still be important unless you're running an OLTP workload
>> (i.e. minimal per backend allocations) and your entire workload fits
>> into shared buffers. What zone_reclaim > 0 essentially does is to never
>> allocate memory from remote nodes. I.e. it will throw away all numa node
>> local OS cache to satisfy a memory allocation (including
>> pagefaults).
>
> I don't think that cpuset spreading of OS buffers and cache, and
> the patch to spread shared memory, will make too much difference
> unless the working set is fully cached. Where I have seen the
> biggest problems is when the active set > one memory node and <
> total machine RAM.

But that's precisely the scenario where vm.zone_reclaim_mode != 0 is a
disaster. You'll end up throwing away the cached pages and rereading
the data from disk, even though the memory *could* have been kept all
in cache.

> I would agree that unless this patch is
> providing benefit for such a fully-cached load, it won't make any
> difference regarding the need for zone_reclaim_mode. Where the
> data is heavily cached, zone_reclaim > 0 might discard some cached
> pages to allow, say, a RAM sort to be done in faster memory (for
> the current process's core), so it might be a wash or even make
> zone_reclaim > 0 a win.

I will believe that when, and only when, I see benchmarks convincingly
demonstrating it. Setting zone_reclaim_mode can only be a win if the
performance benefit from using faster memory is greater than the
performance cost of any rereading-from-disk that happens. IME, that's
a highly unusual situation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-06-10 17:07:23
Message-ID: 53973B4B.10108@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06/08/2014 03:45 PM, Kevin Grittner wrote:
> By default, the OS cache and buffers are allocated in the memory
> node with the shortest "distance" from the CPU a process is running
> on.

Note that this will stop being the default in future Linux kernels.
However, we'll have to deal with the old ones for some time to come.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-06-10 17:34:28
Message-ID: 1402421668.47534.YahooMailNeo@web122302.mail.ne1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 06/08/2014 03:45 PM, Kevin Grittner wrote:
>> By default, the OS cache and buffers are allocated in the memory
>> node with the shortest "distance" from the CPU a process is
>> running on.
>
> Note that this will stop being the default in future Linux kernels.
> However, we'll have to deal with the old ones for some time to come.

I was not aware of that.  Thanks.  Do you have a URL handy?

In any event, that is the part of the problem which I think falls
into the realm of packagers and/or sysadmins; a patch for that
doesn't seem sensible, given how cpusets are implemented.  I did
figure we would want to add some documentation around it, though.
Do you agree that is worthwhile?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-06-26 14:18:37
Message-ID: CADyhKSXs+oUetngSbeiM0tVSRy=QeCaSNBQBDbM=SFQTDg+Zog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

Let me comment on this patch.

It can be applied on head of the master branch, built and run
regression test successfully.
What this patch tries to do is quite simple and obvious.
It suggests operating system to distribute physical pages to
every numa nodes on allocation.

One thing I concern is, it may conflict with numa-balancing
features that is supported in the recent Linux kernel; that
migrates physical pages according to the location of tasks
which references the page beyond the numa zone.
# I'm not sure whether it is applied on shared memory region.
# Please correct me if I misunderstood. But it looks to me
# physical page in shared memory is also moved.
http://events.linuxfoundation.org/sites/events/files/slides/summit2014_riel_chegu_w_0340_automatic_numa_balancing_0.pdf

Probably, interleave policy should work well on OLTP workload.
How about OLAP workload if physical pages are migrated
by operating system transparently to local node?
In OLAP case, less concurrency is required, but a query run
complicated logic (usually including full-scan) on a particular
CPU.

Isn't it make sense to have a GUC to control the numa policy.
In some cases, it makes sense to allocate physical memory
according to operating system's choice.

Thanks,

2014-06-11 2:34 GMT+09:00 Kevin Grittner <kgrittn(at)ymail(dot)com>:
> Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> On 06/08/2014 03:45 PM, Kevin Grittner wrote:
>>> By default, the OS cache and buffers are allocated in the memory
>>> node with the shortest "distance" from the CPU a process is
>>> running on.
>>
>> Note that this will stop being the default in future Linux kernels.
>> However, we'll have to deal with the old ones for some time to come.
>
> I was not aware of that. Thanks. Do you have a URL handy?
>
> In any event, that is the part of the problem which I think falls
> into the realm of packagers and/or sysadmins; a patch for that
> doesn't seem sensible, given how cpusets are implemented. I did
> figure we would want to add some documentation around it, though.
> Do you agree that is worthwhile?
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc: Kevin Grittner <kgrittn(at)ymail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-06-26 15:19:59
Message-ID: CAGTBQpZC9rLMQ1fnNnTyA-v9TC_=Wu9VW3-i6PCj7aGru-08Jg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 26, 2014 at 11:18 AM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:
> One thing I concern is, it may conflict with numa-balancing
> features that is supported in the recent Linux kernel; that
> migrates physical pages according to the location of tasks
> which references the page beyond the numa zone.
> # I'm not sure whether it is applied on shared memory region.
> # Please correct me if I misunderstood. But it looks to me
> # physical page in shared memory is also moved.
> http://events.linuxfoundation.org/sites/events/files/slides/summit2014_riel_chegu_w_0340_automatic_numa_balancing_0.pdf

Sadly, it excludes the OS cache explicitly (when it mentions libc.so),
which is one of the hottest sources of memory bandwidth consumption in
a database.


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-06-26 16:14:39
Message-ID: 1403799279.67192.YahooMailNeo@web122302.mail.ne1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:

> Sadly, it excludes the OS cache explicitly (when it mentions libc.so),
> which is one of the hottest sources of memory bandwidth consumption in
> a database.

Agreed.  On the bright side, the packagers and/or sysadmins can fix this
without any changes to the PostgreSQL code, by creating a custom cpuset
and using it during launch of the postmaster.  I went through that
exercise in my original email.  This patch complements that by
preventing one CPU from managing all of PostgreSQL shared memory, and
thus becoming a bottleneck.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Christoph Berg <cb(at)df7cb(dot)de>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-07-01 09:01:04
Message-ID: 20140701090104.GA15590@msg.df7cb.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Re: Kevin Grittner 2014-06-09 <1402267501(dot)41111(dot)YahooMailNeo(at)web122304(dot)mail(dot)ne1(dot)yahoo(dot)com>
> @@ -536,6 +539,24 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port,
> */
> }
>
> +#ifdef USE_LIBNUMA
> + /*
> + * If this is not a private segment and we are using libnuma, make the
> + * large memory segment interleaved.
> + */
> + if (!makePrivate && numa_available())
> + {
> + void *start;
> +
> + if (AnonymousShmem == NULL)
> + start = memAddress;
> + else
> + start = AnonymousShmem;
> +
> + numa_interleave_memory(start, size, numa_all_nodes_ptr);
> + }
> +#endif

How much difference would it make if numactl --interleave=all was used
instead of using numa_interleave_memory() on the shared memory
segments? I guess that would make backend-local memory also
interleaved, but it would avoid having a dependency on libnuma in the
packages.

The numactl manpage even has this example:

numactl --interleave=all bigdatabase arguments Run big
database with its memory interleaved on all CPUs.

It is probably better to have native support in the postmaster, though
this could be mentioned as an alternative in the documentation.

Christoph
--
cb(at)df7cb(dot)de | http://www.df7cb.de/


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Christoph Berg <cb(at)df7cb(dot)de>, Kevin Grittner <kgrittn(at)ymail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-07-01 10:57:52
Message-ID: 20140701105752.GC26930@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-07-01 11:01:04 +0200, Christoph Berg wrote:
> Re: Kevin Grittner 2014-06-09 <1402267501(dot)41111(dot)YahooMailNeo(at)web122304(dot)mail(dot)ne1(dot)yahoo(dot)com>
> > @@ -536,6 +539,24 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port,
> > */
> > }
> >
> > +#ifdef USE_LIBNUMA
> > + /*
> > + * If this is not a private segment and we are using libnuma, make the
> > + * large memory segment interleaved.
> > + */
> > + if (!makePrivate && numa_available())
> > + {
> > + void *start;
> > +
> > + if (AnonymousShmem == NULL)
> > + start = memAddress;
> > + else
> > + start = AnonymousShmem;
> > +
> > + numa_interleave_memory(start, size, numa_all_nodes_ptr);
> > + }
> > +#endif
>
> How much difference would it make if numactl --interleave=all was used
> instead of using numa_interleave_memory() on the shared memory
> segments? I guess that would make backend-local memory also
> interleaved, but it would avoid having a dependency on libnuma in the
> packages.

I've tested this a while ago, and it's rather painful if you have a OLAP
workload with lots of backend private memory.

> The numactl manpage even has this example:
>
> numactl --interleave=all bigdatabase arguments Run big
> database with its memory interleaved on all CPUs.
>
> It is probably better to have native support in the postmaster, though
> this could be mentioned as an alternative in the documentation.

I wonder if we shouldn't backpatch such a notice.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Christoph Berg <cb(at)df7cb(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-07-01 11:18:12
Message-ID: 1404213492.98740.YahooMailNeo@web122306.mail.ne1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-07-01 11:01:04 +0200, Christoph Berg wrote:

>> How much difference would it make if numactl --interleave=all
>> was used instead of using numa_interleave_memory() on the shared
>> memory segments? I guess that would make backend-local memory
>> also interleaved, but it would avoid having a dependency on
>> libnuma in the packages.
>
> I've tested this a while ago, and it's rather painful if you have
> a OLAP workload with lots of backend private memory.

I'm not surprised; I would expect it to generally have a negative
effect, which would be most pronounced with an OLAP workload.

>> The numactl manpage even has this example:
>>
>>     numactl --interleave=all bigdatabase arguments Run big
>>     database with its memory interleaved on all CPUs.
>>
>> It is probably better to have native support in the postmaster,
>> though this could be mentioned as an alternative in the
>> documentation.
>
> I wonder if we shouldn't backpatch such a notice.

I would want to see some evidence that it was useful first.  In
most of my tests the benefit of interleaving just the OS cache and
PostgreSQL shared_buffers was about 2%.  That could easily be
erased if work_mem allocations and other process-local memory were
not allocated close to the process which was using it.

I expect that the main benefit of this proposed patch isn't the 2%
typical benefit I was seeing, but that it will be insurance against
occasional, much larger hits.  I haven't had much luck making these
worst case episodes reproducible, though.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Christoph Berg <cb(at)df7cb(dot)de>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-07-01 12:02:10
Message-ID: 20140701120210.GC15590@msg.df7cb.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Re: Kevin Grittner 2014-07-01 <1404213492(dot)98740(dot)YahooMailNeo(at)web122306(dot)mail(dot)ne1(dot)yahoo(dot)com>
> Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2014-07-01 11:01:04 +0200, Christoph Berg wrote:
>
> >> How much difference would it make if numactl --interleave=all
> >> was used instead of using numa_interleave_memory() on the shared
> >> memory segments? I guess that would make backend-local memory
> >> also interleaved, but it would avoid having a dependency on
> >> libnuma in the packages.
> >
> > I've tested this a while ago, and it's rather painful if you have
> > a OLAP workload with lots of backend private memory.
>
> I'm not surprised; I would expect it to generally have a negative
> effect, which would be most pronounced with an OLAP workload.

Ok, then +1 on having this in core, even if it buys us a dependency on
something that isn't in the usual base system after OS install.

> > I wonder if we shouldn't backpatch such a notice.
>
> I would want to see some evidence that it was useful first.  In
> most of my tests the benefit of interleaving just the OS cache and
> PostgreSQL shared_buffers was about 2%.  That could easily be
> erased if work_mem allocations and other process-local memory were
> not allocated close to the process which was using it.
>
> I expect that the main benefit of this proposed patch isn't the 2%
> typical benefit I was seeing, but that it will be insurance against
> occasional, much larger hits.  I haven't had much luck making these
> worst case episodes reproducible, though.

Afaict, the numactl notice will only be useful as a postscriptum to
the --with-libnuma docs, with the caveats mentioned. Or we backpatch
(something like) the full docs of the feature, with a note that it's
only 9.5+. (Or the full feature gets backpatched...)

Christoph
--
cb(at)df7cb(dot)de | http://www.df7cb.de/