NUMA packaging and patch

From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: NUMA packaging and patch
Date: 2014-06-08 22:45:01
Message-ID: 1402267501.41111.YahooMailNeo@web122304.mail.ne1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I ran into a situation where a machine with 4 NUMA memory nodes and
40 cores had performance problems due to NUMA.  The problems were
worst right after they rebooted the OS and warmed the cache by
running a script of queries to read all tables.  These were all run
on a single connection.  As it turned out, the size of the database
was just over one-quarter of the size of RAM, and with default NUMA
policies both the OS cache for the database and the PostgreSQL
shared memory allocation were placed on a single NUMA segment, so
access to the CPU package managing that segment became a
bottleneck.  On top of that, processes which happened to run on the
CPU package which had all the cached data had to allocate memory
for local use on more distant memory because there was none left in
the more local memory.

Through normal operations, things eventually tended to shift around
and get better (after several hours of heavy use with substandard
performance).  I ran some benchmarks and found that even in
long-running tests, spreading these allocations among the memory
segments showed about a 2% benefit in a read-only load.  The
biggest difference I saw in a long-running read-write load was
about a 20% hit for unbalanced allocations, but I only saw that
once.  I talked to someone at PGCon who managed to engineer much
worse performance hits for an unbalanced load, although the
circumstances were fairly artificial.  Still, fixing this seems
like something worth doing if further benchmarks confirm benefits
at this level.

By default, the OS cache and buffers are allocated in the memory
node with the shortest "distance" from the CPU a process is running
on.  This is determined by a the "cpuset" associated with the
process which reads or writes the disk page.  Typically a NUMA
machine starts with a single cpuset with a policy specifying this
behavior.  Fixing this aspect of things seems like an issue for
packagers, although we should probably document it for those
running from their own source builds.

To set an alternate policy for PostgreSQL, you first need to find
or create the location for cpuset specification, which uses a
filesystem in a way similar to the /proc directory.  On a machine
with more than one memory node, the appropriate filesystem is
probably already mounted, although different distributions use
different filesystem names and mount locations.  I will illustrate
the process on my Ubuntu machine.  Even though it has only one
memory node (and so, this makes no difference), I have it handy at
the moment to confirm the commands as I put them into the email.

# Sysadmin must create the root cpuset if not already done.  (On a
# system with NUMA memory, this will probably already be mounted.)
# Location and options can vary by distro.

sudo sudo mkdir /dev/cpuset
sudo mount -t cpuset none /dev/cpuset

# Sysadmin must create a cpuset for postgres and configure
# resources.  This will normally be all cores and all RAM.  This is
# where we specify that this cpuset will spread pages among its
# memory nodes.

sudo mkdir /dev/cpuset/postgres
sudo /bin/bash -c "echo 0-3 >/dev/cpuset/postgres/cpus"
sudo /bin/bash -c "echo 0 >/dev/cpuset/postgres/mems"
sudo /bin/bash -c "echo 1 >/dev/cpuset/postgres/memory_spread_page"

# Sysadmin must grant permissions to the desired setting(s).
# This could be by user or group.

sudo chown postgres /dev/cpuset/postgres/tasks

# The pid of postmaster or an ancestor process must be written to
# the tasks "file" of the cpuset.  This can be a shell from which
# pg_ctl is run, at least for bash shells.  It could also be
# written by the postmaster itself, essentially as an extra pid
# file.  Possible snippet from a service script:

echo $$ >/dev/cpuset/postgres/tasks
pg_ctl start ...

Where the OS cache is larger than shared_buffers, the above is
probably more important than the attached patch, which causes the
main shared memory segment to be spread among all available memory
nodes.  This patch only compiles in the relevant code if configure
is run using the --with-libnuma option, in which case a dependency
on the numa library is created.  It is v3 to avoid confusion with
earlier versions I have shared with a few people off-list.  (The
only difference from v2 is fixing bitrot.)

I'll add it to the next CF.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
numa-interleave-shared-buffers-v3.diff text/x-diff 5.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2014-06-08 23:19:08 Re: MinGW/Cygwin build snags
Previous Message Tom Lane 2014-06-08 22:04:46 Re: MinGW/Cygwin build snags