Re: NUMA packaging and patch

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NUMA packaging and patch
Date: 2014-06-09 14:49:03
Message-ID: CAHyXU0x-t9x46baGqoV4Bm=PAgzkWg2-xHng73kX7v1YvDdkxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jun 8, 2014 at 5:45 PM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> I ran into a situation where a machine with 4 NUMA memory nodes and
> 40 cores had performance problems due to NUMA. The problems were
> worst right after they rebooted the OS and warmed the cache by
> running a script of queries to read all tables. These were all run
> on a single connection. As it turned out, the size of the database
> was just over one-quarter of the size of RAM, and with default NUMA
> policies both the OS cache for the database and the PostgreSQL
> shared memory allocation were placed on a single NUMA segment, so
> access to the CPU package managing that segment became a
> bottleneck. On top of that, processes which happened to run on the
> CPU package which had all the cached data had to allocate memory
> for local use on more distant memory because there was none left in
> the more local memory.
>
> Through normal operations, things eventually tended to shift around
> and get better (after several hours of heavy use with substandard
> performance). I ran some benchmarks and found that even in
> long-running tests, spreading these allocations among the memory
> segments showed about a 2% benefit in a read-only load. The
> biggest difference I saw in a long-running read-write load was
> about a 20% hit for unbalanced allocations, but I only saw that
> once. I talked to someone at PGCon who managed to engineer much
> worse performance hits for an unbalanced load, although the
> circumstances were fairly artificial. Still, fixing this seems
> like something worth doing if further benchmarks confirm benefits
> at this level.
>
> By default, the OS cache and buffers are allocated in the memory
> node with the shortest "distance" from the CPU a process is running
> on. This is determined by a the "cpuset" associated with the
> process which reads or writes the disk page. Typically a NUMA
> machine starts with a single cpuset with a policy specifying this
> behavior. Fixing this aspect of things seems like an issue for
> packagers, although we should probably document it for those
> running from their own source builds.
>
> To set an alternate policy for PostgreSQL, you first need to find
> or create the location for cpuset specification, which uses a
> filesystem in a way similar to the /proc directory. On a machine
> with more than one memory node, the appropriate filesystem is
> probably already mounted, although different distributions use
> different filesystem names and mount locations. I will illustrate
> the process on my Ubuntu machine. Even though it has only one
> memory node (and so, this makes no difference), I have it handy at
> the moment to confirm the commands as I put them into the email.
>
> # Sysadmin must create the root cpuset if not already done. (On a
> # system with NUMA memory, this will probably already be mounted.)
> # Location and options can vary by distro.
>
> sudo sudo mkdir /dev/cpuset
> sudo mount -t cpuset none /dev/cpuset
>
> # Sysadmin must create a cpuset for postgres and configure
> # resources. This will normally be all cores and all RAM. This is
> # where we specify that this cpuset will spread pages among its
> # memory nodes.
>
> sudo mkdir /dev/cpuset/postgres
> sudo /bin/bash -c "echo 0-3 >/dev/cpuset/postgres/cpus"
> sudo /bin/bash -c "echo 0 >/dev/cpuset/postgres/mems"
> sudo /bin/bash -c "echo 1 >/dev/cpuset/postgres/memory_spread_page"
>
> # Sysadmin must grant permissions to the desired setting(s).
> # This could be by user or group.
>
> sudo chown postgres /dev/cpuset/postgres/tasks
>
> # The pid of postmaster or an ancestor process must be written to
> # the tasks "file" of the cpuset. This can be a shell from which
> # pg_ctl is run, at least for bash shells. It could also be
> # written by the postmaster itself, essentially as an extra pid
> # file. Possible snippet from a service script:
>
> echo $$ >/dev/cpuset/postgres/tasks
> pg_ctl start ...
>
> Where the OS cache is larger than shared_buffers, the above is
> probably more important than the attached patch, which causes the
> main shared memory segment to be spread among all available memory
> nodes. This patch only compiles in the relevant code if configure
> is run using the --with-libnuma option, in which case a dependency
> on the numa library is created. It is v3 to avoid confusion with
> earlier versions I have shared with a few people off-list. (The
> only difference from v2 is fixing bitrot.)
>
> I'll add it to the next CF.

Hm, your patch seems to boil down to interleave_memory(start, size,
numa_all_nodes_ptr) inside PGSharedMemoryCreate(). I've read your
email a couple of times and am a little hazy around a couple of
points, in particular: "the above is probably more important than the
attached patch". So I have a couple of questions:

*) There is a lot of advice floating around (for example here:
http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)
to instruct operators to disable zone_reclaim. Will your changes
invalidate any of that advice?

*) is there any downside to enabling --with-libnuma if you have
support? Do you expect packagers will enable it generally? Why not
just always build it in (if configure allows it) and rely on a GUC if
there is some kind of tradeoff (and if there is one, what kinds of
things are you looking for to manage it)?

*) The bash script above, what problem does the 'alternate policy' solve?

*) What kinds of improvements (even if in very general terms) will we
see from better numa management? Are there further optimizations
possible?

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-06-09 14:51:17 Re: Supporting Windows SChannel as OpenSSL replacement
Previous Message Tom Lane 2014-06-09 14:46:09 Re: "RETURNING PRIMARY KEY" syntax extension