Re: lazy vxid locks, v1

From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: lazy vxid locks, v1
Date: 2011-06-13 11:55:28
Message-ID: 4DF5FAB0.8060605@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/12/2011 11:39 PM, Robert Haas wrote:
> Here is a patch that applies over the "reducing the overhead of
> frequent table locks" (fastlock-v3) patch and allows heavyweight VXID
> locks to spring into existence only when someone wants to wait on
> them. I believe there is a large benefit to be had from this
> optimization, because the combination of these two patches virtually
> eliminates lock manager traffic on "pgbench -S" workloads. However,
> there are several flies in the ointment.
>
> 1. It's a bit of a kludge. I leave it to readers of the patch to
> determine exactly what about this patch they think is kludgey, but
> it's likely not the empty set. I suspect that MyProc->fpLWLock needs
> to be renamed to something a bit more generic if we're going to use it
> like this, but I don't immediately know what to call it. Also, the
> mechanism whereby we take SInvalWriteLock to work out the mapping from
> BackendId to PGPROC * is not exactly awesome. I don't think it
> matters from a performance point of view, because operations that need
> VXID locks are sufficiently rare that the additional lwlock traffic
> won't matter a bit. However, we could avoid this altogether if we
> rejiggered the mechanism for allocating PGPROCs and backend IDs.
> Right now, we allocate PGPROCs off of linked lists, except for
> auxiliary procs which allocate them by scanning a three-element array
> for an empty slot. Then, when the PGPROC subscribes to sinval, the
> sinval mechanism allocates a backend ID by scanning for the lowest
> unused backend ID in the ProcState array. If we changed the logic for
> allocating PGPROCs to mimic what the sinval queue currently does, then
> the backend ID could be defined as the offset into the PGPROC array.
> Translating between a backend ID and a PGPROC * now becomes a matter
> of pointer arithmetic. Not sure if this is worth doing.
>
> 2. Bad thing happen with large numbers of connections. This patch
> increases peak performance, but as you increase the number of
> concurrent connections beyond the number of CPU cores, performance
> drops off faster with the patch than without it. For example, on the
> 32-core loaner from Nate Boley, using 80 pgbench -S clients, unpatched
> HEAD runs at ~36K TPS; with fastlock, it jumps up to about ~99K TPS;
> with this patch also applied, it drops down to about ~64K TPS, despite
> the fact that nearly all the contention on the lock manager locks has
> been eliminated. On Stefan Kaltenbrunner's 40-core box, he was
> actually able to see performance drop down below unpatched HEAD with
> this applied! This is immensely counterintuitive. What is going on?

just to add actual new numbers to the discussion(pgbench -n -S -T 120 -c
X -j X) on that particular 40cores/80 threads box:

unpatched:

c1: tps = 7808.098053 (including connections establishing)
c4: tps = 29941.444359 (including connections establishing)
c8: tps = 58930.293850 (including connections establishing)
c16: tps = 106911.385826 (including connections establishing)
c24: tps = 117401.654430 (including connections establishing)
c32: tps = 110659.627803 (including connections establishing)
c40: tps = 107689.945323 (including connections establishing)
c64: tps = 104835.182183 (including connections establishing)
c80: tps = 101885.549081 (including connections establishing)
c160: tps = 92373.395791 (including connections establishing)
c200: tps = 90614.141246 (including connections establishing)

fast locks:

c1: tps = 7710.824723 (including connections establishing)
c4: tps = 29653.578364 (including connections establishing)
c8: tps = 58827.195578 (including connections establishing)
c16: tps = 112814.382204 (including connections establishing)
c24: tps = 154559.012960 (including connections establishing)
c32: tps = 189281.391250 (including connections establishing)
c40: tps = 215807.263233 (including connections establishing)
c64: tps = 180644.527322 (including connections establishing)
c80: tps = 118266.615543 (including connections establishing)
c160: tps = 68957.999922 (including connections establishing)
c200: tps = 68803.801091 (including connections establishing)

fast locks + lazy vxid:

c1: tps = 7828.644389 (including connections establishing)
c4: tps = 30520.558169 (including connections establishing)
c8: tps = 60207.396385 (including connections establishing)
c16: tps = 117923.775435 (including connections establishing)
c24: tps = 158775.317590 (including connections establishing)
c32: tps = 195768.530589 (including connections establishing)
c40: tps = 223308.779212 (including connections establishing)
c64: tps = 152848.742883 (including connections establishing)
c80: tps = 65738.046558 (including connections establishing)
c160: tps = 57075.304457 (including connections establishing)
c200: tps = 59107.675182 (including connections establishing)

so my reading of that is that we currently "only" scale well to ~12
physical cores, the fast locks patch gets us pretty nicely past that
point to a total scale of a bit better than 2x. but it degrades fairly
quick after that point and at a level of 2x the number of threads in the
box we are only able to get 2/3 of unpatched -HEAD(!).

with the lazy vxid patch on top the curve looks even more interesting,
we are scaling to an even higher peak but we degrade even worse and at
c80 (which equals the number of threads in the box) we are already only
able to get the amount of tps that unpatched -HEAD would give at ~10 cores..
Another thing worth noting is that with the patches we have MUCH less
idle - which is good for the cases where we are getting a benefit for
(as in higher throughput) - but the extrem case now is fast locks + lazy
which manages to get us less than 8% idle @c160 BUT only 57000 tps while
unpatched -HEAD is 75% idle and doing 92000 tps, or said otherwise - we
need almost 4x the computing resoures to get only 2/3 of the performance
(so a ~7x WORSE on a CPU/tps scale).

all those tests are done with pgbench running on the same box - which
has a noticable impact on the results because pgbench is using ~1 core
per 8 cores of the backend tested in cpu resoures - though I don't think
it causes any changes in the results that would show the performance
behaviour in a different light.

Stefan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-06-13 12:12:18 Re: [v9.1] sepgsql - userspace access vector cache
Previous Message Kohei KaiGai 2011-06-13 11:51:55 Re: [v9.1] sepgsql - userspace access vector cache