Re: optimal hardware for postgres?

Lists: pgsql-general
From: "Guy Rouillier" <guyr(at)masergy(dot)com>
To: <pgsql-general(at)postgresql(dot)org>
Subject: Re: optimal hardware for postgres?
Date: 2005-04-26 05:34:02
Message-ID: CC1CF380F4D70844B01D45982E671B239E87BD@mtxexch01.add0.masergy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

William Yu wrote:

> On other note -- if you are thinking about SMP Opteron, you may
> actually get better performance from 1x275 (Dual Core 2.2ghz) versus
> 2x248 (2.2ghz). Full duals have twice the bandwidth but without good
> NUMA support, memory has to be interleaved between CPUs.

I thought the 2.6 kernel had good NUMA support? (I realize I just made
an assumption that the original poster was running PG on Linux 2.6.)

--
Guy Rouillier


From: William Yu <wyu(at)talisys(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: optimal hardware for postgres?
Date: 2005-04-26 08:32:22
Message-ID: d4kuas$l1q$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Linux 2.6 does have NUMA support. But whether it's actually a for
Postgres is debatable due to the architecture.

First let's take a look at how NUMA makes this run faster in a 2x
Opteron system. The idea is that the processes running on CPU0 can
access memory attached to that CPU a lot faster than memory attached to
CPU1. So in a NUMA-aware system, a process running on CPU0 can request
all it's memory be located memory bank0.

Now let's look at how Postgres manages memory. 1 big shared memory block
and an even bigger OS cache pool. Assuming the postgres decides to
follow the NUMA model -- the main process requests a big 150MB chunk of
shared memory on CPU0. All good and dandy at this point. So let's start
making connections to postgres. Now where do you put these processes? Do
you continue to pile these processes onto CPU0 where memory latency is
really low? Or do you use the idle CPU but accept the +40ns latency?

Continue on and think about the cache. One of the design philosphies of
Postgres is that the OS manages the memory. So you've got 7GB of memory
left over -- 3GB is on CPU0 and 4GB on CPU1. When you happen to hit
stuff cached on CPU0, you get extremely fast queries. When you hit stuff
on CPU1, your queries are much slower (although still way faster than
disk).

If your dataset is small and your multi-user load is low (e.g. you have
8GB of RAM but you only need to use 4GB to do all your stuff), Linux can
probably put all your processes onto the "fast" CPU so NUMA could be a
win. But otherwse, what you'd probably end up with is uneven performance
-- sometimes very fast and sometimes very slow. Would you prefer this
kind of behavior over interleaved memory access where everything
performs in the middle? Dunno. Would definitely depend on the server
usage pattern.

Guy Rouillier wrote:
> William Yu wrote:
>
>
>>On other note -- if you are thinking about SMP Opteron, you may
>>actually get better performance from 1x275 (Dual Core 2.2ghz) versus
>>2x248 (2.2ghz). Full duals have twice the bandwidth but without good
>>NUMA support, memory has to be interleaved between CPUs.
>
>
> I thought the 2.6 kernel had good NUMA support? (I realize I just made
> an assumption that the original poster was running PG on Linux 2.6.)
>


From: Marco Colombo <pgsql(at)esiway(dot)net>
To: William Yu <wyu(at)talisys(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: optimal hardware for postgres?
Date: 2005-04-26 10:34:13
Message-ID: 1114511653.12081.42.camel@Frodo.esi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 2005-04-26 at 01:32 -0700, William Yu wrote:
> Linux 2.6 does have NUMA support. But whether it's actually a for
> Postgres is debatable due to the architecture.
>
> First let's take a look at how NUMA makes this run faster in a 2x
> Opteron system. The idea is that the processes running on CPU0 can
> access memory attached to that CPU a lot faster than memory attached to
> CPU1. So in a NUMA-aware system, a process running on CPU0 can request
> all it's memory be located memory bank0.
[...]

This is only part of the truth. You should compare it with real SMP
solutions. The idea is that CPU0 can access directly attached memory
faster than it would on a SMP system, given equivalent or so technology,
of course. So NUMA has a fast path and a slow path, while SMP has only
one, uniform, medium path. The whole point is where the SMP path lies.

If it's close to the fast (local) path in NUMA, then NUMA won't pay off
(performance wise) unless the application is NUMA-aware _and_
NUMA-friendly (which depends on how the application is writter, assuming
the underlying problem _can_ be solved in a NUMA-friendly way).

If the SMP path is close to the slow (remote) path in NUMA (for example,
they have to keep the caches coherent, and obey to memory barriers and
locks) then NUMA has little to loose for NUMA-unaware or NUMA-unfriendly
applications (worst case), and a lot to gain when some NUMA-aware
optimizations kick in.

I've read some articles that refer to the name SUMO (sufficiently
uniform memory organization) AMD would use to describe their NUMA,
which seems to imply that their worst case is "sufficiently" close
to the usual SMP timing.

There are other interesting issues in SMP scaling, on the software side.
Scaling with N > 8 might force partitioning at software level anyway,
in order to reduce the number of locks, both as software objects
(reduce software complexity) and as hardware events (reduce time spent
in useless synchronizations). See:

http://www.bitmover.com/llnl/smp.pdf

This also affects ccNUMA, of course, I'm not saying NUMA avoids this in
any way. But it's a price _both_ have to pay, moving their numbers
towards the worst case anyway (which makes the worst case not so worse).

.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo(at)ESI(dot)it