Spinlocks, yet again: analysis and proposed patches

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-11 21:59:49
Message-ID: 10277.1126475989@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The test case I just posted shows that our spinlock code, which we had
thought largely done, is once again becoming a performance bottleneck.
It's time to resurrect some of the ideas we kicked around in early
2002, and didn't pursue because we decided spinlocks weren't our major
performance problem.

I started investigating this by trying to understand why the "futex
spinlock" patch had helped the Red Hat customer I mentioned, when it
hadn't done anything much in last year's experiments. It turns out
that this customer was using an 8-way Opteron, and the futex patch
helps a lot more on that architecture than others. Some random
experimentation eventually turned up part of the reason: the x86 TAS
assembly code we are using is pessimal for x86_64. Specifically,
s_lock.h uses this for both architectures:

/* Use a non-locking test before asserting the bus lock */
__asm__ __volatile__(
" cmpb $0,%1 \n"
" jne 1f \n"
" lock \n"
" xchgb %0,%1 \n"
"1: \n"

and it turns out that deleting the cmpb and jne makes things go
significantly faster on Opterons. So the "non-locking test" is
not a win at all on x86_64. As best I can tell from testing,
removing it wins because it causes the rate of spinlock delays
to drop significantly. Why this should be is not completely
obvious. I speculate that the Opterons are capable of pulling
in a copy of the cache line that contains the spinlock and then
running 100 iterations of the cmpb test without ever noticing
that the processor holding the lock has now released it. Without
the cmpb, they are forced by the locking xchgb test to actually
look at the real state of the spinlock each time.

I kinda suspect that the cmpb test is a no-op or loss on all
Intelish processors: it can only be a win if you expect a lot
of contention for the spin lock, but in percentage terms we still
have a very low conflict rate, so in most executions of the TAS
macro, the cmpb is just wasted cycles. Bottom line: we definitely
don't want it for x86_64, and maybe not at all, but we need more
research to decide the latter.

The second reason that the futex patch is helping is that when
a spinlock delay does occur, it allows the delaying process to be
awoken almost immediately, rather than delaying 10 msec or more
as the existing code does. However, given that we are only expecting
the spinlock to be held for a couple dozen instructions, using the
kernel futex mechanism is huge overkill --- the in-kernel overhead
to manage the futex state is almost certainly several orders of
magnitude more than the delay we actually want.

I looked into several other methods of doing the spinlock delay
instead. I think all of these were suggested at one point or
another in our earlier discussions of spinlocks:

1. Use sched_yield() if available: it does just what we want,
ie, yield the processor without forcing any useless time delay
before we can be rescheduled. This doesn't exist everywhere
but it exists in recent Linuxen, so I tried it. It made for a
beautiful improvement in my test case numbers: CPU utilization
went to 100% and the context swap rate to almost nil. Unfortunately,
I also saw fairly frequent "stuck spinlock" panics when running
more queries than there were processors --- this despite increasing
NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield
anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect.
(I speculate that it's set up to only yield the processor to other
processes already affiliated to that processor. In any case, it
is definitely capable of getting through 10000 yields without
running the guy who's holding the spinlock.)

2. Reduce the length of the select() delays in s_lock. The current code
delays in quanta of 10msec, but on recent Linuxen (and I think other
platforms too) the scheduling quantum is 1msec, so we can get the
processor back faster if we ask for a 1msec delay. I tried this and it
is definitely a win on Linux; it doesn't seem to hurt anything on older
systems either, they just round the delay up to 10msec like before.
So I think we should do this, even though it's only a partial answer.

3. Modify the spin loop to do a little more work between TAS() tests.
In the existing code there are only about half a dozen instructions
total in the normal spin loop, most of them very fast, with the result
that the spinning processor does its best to monopolize the system bus
with locked probe instructions. This is obviously not good, as it'll
interfere with the ability of the spinlock holder to complete its work
and release the lock. (The bulk of the spinlock uses are for LWLocks,
and with the current data structures the LWLock's own state is usually
going to be in the same cache line as the spinlock proper, so that
cache grabs for the spinlock hurt the LWLock state updating as well
as the eventual spinlock release.) I modified the code to use an
integer modulo operation as part of the test for SPINS_PER_DELAY;
on most processors this should keep the processor's attention and
keep it away from the system bus for several times longer than an
average instruction. (It's relevant here that the "rep;nop" used
to slow the spin loop on recent Intel processors appears to be
completely ignored by Opterons. Basically we need a non-hardware-
specific trick to slow the loop.)

4. Modify SPINS_PER_DELAY. Looping more times instead of entering the
kernel is a win for SMP boxes: whenever you give up and call the kernel
to delay, it's likely that the spinlock holder will have released the
lock before you even get all the way into the kernel, let alone get
rescheduled again. I found that pushing SPINS_PER_DELAY above 200
caused the select delay rate to fall to almost nil in this test case.
However, that would be counterproductive for uniprocessors where the
optimal strategy is normally to give up the processor immediately.
We've talked about making the spin count user configurable, but I never
cared for that answer. Instead, I've worked up some code to
automatically adjust the spin count depending on whether we seem to be
in a uniprocessor or multiprocessor. The idea is to decrease the spin
count (down to a lower limit) whenever s_lock has to delay in order to
get a lock, and to increase the count (up to a limit) whenever we see
the lock come free before we've had to delay. The former condition
suggests that we are in a uniprocessor and the latter suggests we are in
a multiprocessor. I recall having complained that this idea would
probably be unstable, but in testing it seems to do pretty nicely at
converging to the minimum spin count on a 1-CPU machine and the maximum
limit on an SMP box. It could do with more testing though to see if it
works the same for other people.

My proposal therefore is to do #2, #3, and #4, and to modify the TAS
assembly code at least on x86_64. Together, these changes bring
my test case on a 4-way Opteron from

N, runtime: 1 36s 2 61s 4 105s 8 198s

to

N, runtime: 1 35s 2 48s 4 58s 8 105s

At 4 queries, CPU utilization is 100% and the context swap rate nearly
nil, which seems to mean we've gained another (temporary no doubt)
victory over the spinlock problem.

I attach two proposed patches: the first removes the cmpb/jne from
the x86 TAS assembly code, and the second one does the s_lock changes
enumerated as points #2, #3, #4. The first one in particular needs
more testing to see if it hurts performance on any non-Opteron x86
chips. (If so, we'd just make it conditional to x86_64.)

Comments and testing invited.

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 637 bytes
unknown_filename text/plain 12.9 KB

From: Kurt Roeckx <kurt(at)roeckx(dot)be>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-11 22:23:46
Message-ID: 20050911222346.GA27800@roeckx.be
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
>
> I kinda suspect that the cmpb test is a no-op or loss on all
> Intelish processors: it can only be a win if you expect a lot
> of contention for the spin lock, but in percentage terms we still
> have a very low conflict rate, so in most executions of the TAS
> macro, the cmpb is just wasted cycles. Bottom line: we definitely
> don't want it for x86_64, and maybe not at all, but we need more
> research to decide the latter.

I think an important question is wether this is for x86_64 in
general, of opteron specific. It could be that it's not the same
on Intel's EM64Ts.

One reason this might behave differently for opterons is that
it's a cc-NUMA instead of the "normal" SMP.

Something else to consider is the OS you're using. I've been
told that Linux isn't that good in NUMA and FreeBSD might be
better.

Kurt


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kurt Roeckx <kurt(at)roeckx(dot)be>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 00:54:44
Message-ID: 11083.1126486484@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kurt Roeckx <kurt(at)roeckx(dot)be> writes:
> On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
>> I kinda suspect that the cmpb test is a no-op or loss on all
>> Intelish processors:

> I think an important question is wether this is for x86_64 in
> general, of opteron specific. It could be that it's not the same
> on Intel's EM64Ts.

Good point --- anyone have one to try?

> Something else to consider is the OS you're using. I've been
> told that Linux isn't that good in NUMA and FreeBSD might be
> better.

It's hard to see how the OS could affect behavior at the level of
processor cache operations --- unless they did something truly
spectacularly stupid, like mark main memory non-cacheable.

regards, tom lane


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kurt Roeckx <kurt(at)roeckx(dot)be>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 01:15:55
Message-ID: 20050912011555.GE6026@ns.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> Kurt Roeckx <kurt(at)roeckx(dot)be> writes:
> > On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
> >> I kinda suspect that the cmpb test is a no-op or loss on all
> >> Intelish processors:
>
> > I think an important question is wether this is for x86_64 in
> > general, of opteron specific. It could be that it's not the same
> > on Intel's EM64Ts.
>
> Good point --- anyone have one to try?

I've got one I can test on. I need to upgrade the kernel and some other
things on it though (it's running 2.6.8 atm, and an older
Debian/unstable which I should probably bring up to current).

I'll work on it starting now and post results once I get some.

Stephen


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 02:39:46
Message-ID: 20050912023946.GF6026@ns.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> My proposal therefore is to do #2, #3, and #4, and to modify the TAS
> assembly code at least on x86_64. Together, these changes bring
> my test case on a 4-way Opteron from
>
> N, runtime: 1 36s 2 61s 4 105s 8 198s

em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:

N, runtime: 1 31s 2 47s 4 86s 8 159s

> to
>
> N, runtime: 1 35s 2 48s 4 58s 8 105s

N, runtime: 1 32s 2 53s 4 90s 8 169s

CPU utilization is definitely higher when running with the patch though.
Hope this helps, happy to do additional testing if necessary.

Thanks,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 03:16:24
Message-ID: 12223.1126494984@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:

> N, runtime: 1 31s 2 47s 4 86s 8 159s

> N, runtime: 1 32s 2 53s 4 90s 8 169s

Er, which (or both) of the two patches did you apply here?

regards, tom lane


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 03:17:51
Message-ID: 20050912031751.GG6026@ns.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> Stephen Frost <sfrost(at)snowman(dot)net> writes:
> > em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:
>
> > N, runtime: 1 31s 2 47s 4 86s 8 159s
>
> > N, runtime: 1 32s 2 53s 4 90s 8 169s
>
> Er, which (or both) of the two patches did you apply here?

Applied both, sorry that wasn't clear.

Thanks,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 03:20:41
Message-ID: 12266.1126495241@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
>> Er, which (or both) of the two patches did you apply here?

> Applied both, sorry that wasn't clear.

Thanks. If you've got the time, could you try the two patches
separately and see what you get?

regards, tom lane


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 03:37:06
Message-ID: 20050912033706.GH6026@ns.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> Stephen Frost <sfrost(at)snowman(dot)net> writes:
> > * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> >> Er, which (or both) of the two patches did you apply here?
>
> > Applied both, sorry that wasn't clear.
>
> Thanks. If you've got the time, could you try the two patches
> separately and see what you get?

Sure.

CVS Head:

N, runtime: 1 31s 2 47s 4 86s 8 159s

With just slock-no-cmpb.patch:

N, runtime: 1 32s 2 39s 4 82s 8 167s

With just spin-delay.patch

N, runtime: 1 32s 2 52s 4 94s 8 164s

With both:

N, runtime: 1 32s 2 53s 4 90s 8 169s

Hope that helps,

Stephen


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kurt Roeckx <kurt(at)roeckx(dot)be>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 03:40:11
Message-ID: 87ll232btw.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> > Something else to consider is the OS you're using. I've been
> > told that Linux isn't that good in NUMA and FreeBSD might be
> > better.
>
> It's hard to see how the OS could affect behavior at the level of
> processor cache operations --- unless they did something truly
> spectacularly stupid, like mark main memory non-cacheable.

Well it could schedule processes on processors in ways that force less than
optimal memory usage patterns.

But maybe you should tell the Altix folk with their 32-processor 384Gb NUMA
machines what you've "been told" about Linux not being that good in NUMA.
Really, these kind of cargo cult anecdotes are pretty pointless.

--
greg


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 03:58:17
Message-ID: 20050912035817.GI6026@ns.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Stephen Frost (sfrost(at)snowman(dot)net) wrote:
> > Thanks. If you've got the time, could you try the two patches
> > separately and see what you get?
>
> Sure.
[...]

Just to be clear- this was from a completely default 'make install'
using the Debian configure options from 8.0.3 (which aren't that
particularly interesting really- nls, integer-datetimes, debug,
disable-rpath, tcl, perl, python, pam, krb5, openssl, gnu-ld,
enable-thread-safety). If it'd be useful for the test I can adjust
whichever parameters are appropriate (work_mem, cache_size, etc).

There's absolutely nothing else going on except for these test, a few
shells, top, etc, on the box.

Enjoy,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 04:12:46
Message-ID: 12656.1126498366@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> Some changes are based on tests and heuristics, so can we make them into the
> configure script so the choice could be made automatically?

It's a bit premature to propose that, when we don't yet know if the
suggested changes are a win or loss under any conditions, let alone
how to automatically tell the difference. Right now we are in the
find-out-the-facts mode ...

regards, tom lane


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 05:01:39
Message-ID: 43250BB3.8020705@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> I attach two proposed patches: the first removes the cmpb/jne from
> the x86 TAS assembly code, and the second one does the s_lock changes
> enumerated as points #2, #3, #4. The first one in particular needs
> more testing to see if it hurts performance on any non-Opteron x86
> chips. (If so, we'd just make it conditional to x86_64.)
>

2x PIII 1G 2G Freebsd 6.0Beta4

8.1beta1 (2005-08-28):

N runtime: 1 85s 2 139s 4 220s

8.1beta1 (2005-08-28) + patch 1 (s_lock.h only)

N runtime: 1 89s 2 137s 4 229s

8.1beta1 (2005-08-28) + patch 2

N runtime: 1 84s 2 108s 4 214s

Observe the interesting little speed improvement for patch 2 with 2
processes (seems to be repeatable).

Let me know if you want to see vmstat output for any of these.

regards

Mark


From: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 06:30:44
Message-ID: dg2sp7$1qv2$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>
> My proposal therefore is to do #2, #3, and #4, and to modify the TAS
> assembly code at least on x86_64. Together, these changes bring
> my test case on a 4-way Opteron from
>

Some changes are based on tests and heuristics, so can we make them into the
configure script so the choice could be made automatically?

Regards,
Qingqing


From: "Michael Paesold" <mpaesold(at)gmx(dot)at>
To: <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 18:37:54
Message-ID: 022a01c5b7c9$170a07d0$0f01a8c0@zaphod
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Comments and testing invited.

I have tested the patches on a Dual Xeon 2,4 GHz w/ HT (no EM64T).
(Configured with
"CFLAGS='-O2 -mcpu=pentium4 -march=pentium4' --enable-casserts").
The results were pretty stable (around .2 seconds). I would not trust the
numbers for N=2, linux, at least 2.4 is not good at not scheduling two
running processes on two different HTs on the same core. Those values also
had the most variance (> 1s). All other measures were quite stable over
several runs.

CVS tip from 2005-09-12 ~16:00
1: 57s 2: 82s 4: 124s 8: 237s

with only slock-no-cmpb.patch applied
1: 55s 2: 79s 4: 119s 8: 229s

with only spin-delay.patch applied
1: 56s 2: 79s 4: 124s 8: 235s

with both patches applied
1: 55s 2: 78s 4: 124s 8: 235s

compare to 7.4.8 on the same machine ;-)
1: 92s 2: 235s 4: 474s 8: did not try ...

It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch
does not really help much on this machine. That seems to match Stephen
Frost's results with EM64T, if I read them correctly.

The cs rate is about 150 on CVS tip without patches and below 100 with the
patches (all three cases).
With 7.4.8 its 230000-280000 with N>1. 8.1 is clearly the winner here. Great
work, Tom.

I hope some more data helps.

Best Regards,
Michael Paesold


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Michael Paesold" <mpaesold(at)gmx(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 18:48:27
Message-ID: 23617.1126550907@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Michael Paesold" <mpaesold(at)gmx(dot)at> writes:
> It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch
> does not really help much on this machine. That seems to match Stephen
> Frost's results with EM64T, if I read them correctly.

Yeah, it's interesting that you both see slock-no-cmpb as saving some
cycles and the second patch as giving them back. I wonder whether the
integer-modulo-to-slow-the-loop trick is counterproductive on your
machines. You both seem to be using hardware that will recognize
rep;nop and maybe that's all that's needed.

I probably should have broken down the spindelay patch into multiple
components. But it's only a small change --- could you try simplifying
the patched line

if ((--spins % MAX_SPINS_PER_DELAY) == 0)

to

if (--spins == 0)

and see how the patch does that way?

regards, tom lane


From: "Michael Paesold" <mpaesold(at)gmx(dot)at>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>, "Stephen Frost" <sfrost(at)snowman(dot)net>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 19:12:50
Message-ID: 023701c5b7cd$f8be31c0$0f01a8c0@zaphod
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> I probably should have broken down the spindelay patch into multiple
> components. But it's only a small change --- could you try simplifying
> the patched line
>
> if ((--spins % MAX_SPINS_PER_DELAY) == 0)
>
> to
>
> if (--spins == 0)
>
> and see how the patch does that way?

I'll do tomorrow morning (CEST, i.e. in about 11 hours).
Best Regards,
Michael Paesold


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Michael Paesold" <mpaesold(at)gmx(dot)at>, pgsql-hackers(at)postgresql(dot)org, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-12 23:22:38
Message-ID: 6509.1126567358@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> ... and see how the patch does that way?

BTW, please do look at "vmstat 1" while running the test case
corresponding to your number of processors. It's hard to tell from the
runtime alone whether the patch is fully accomplishing its goal of
reducing wasted cycles. If you see user CPU percent go to 100 and
context swaps drop to a low value, though, you know it's working.

regards, tom lane


From: Marko Kreen <marko(at)l-t(dot)ee>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 07:31:19
Message-ID: 20050913073119.GA29529@l-t.ee
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
> The second reason that the futex patch is helping is that when
> a spinlock delay does occur, it allows the delaying process to be
> awoken almost immediately, rather than delaying 10 msec or more
> as the existing code does. However, given that we are only expecting
> the spinlock to be held for a couple dozen instructions, using the
> kernel futex mechanism is huge overkill --- the in-kernel overhead
> to manage the futex state is almost certainly several orders of
> magnitude more than the delay we actually want.

Why do you think so? AFAIK on uncontented case there will be no
kernel access, only atomic inc/dec. On contented case you'll
want task switch anyway, so the futex managing should not
matter. Also this mechanism is specifically optimized for
inter-process locking, I don't think you can get more efficient
mechanism from side-effects from generic syscalls.

If you don't want Linux-specific locking in core code, then
it's another matter...

> 1. Use sched_yield() if available: it does just what we want,
> ie, yield the processor without forcing any useless time delay
> before we can be rescheduled. This doesn't exist everywhere
> but it exists in recent Linuxen, so I tried it. It made for a
> beautiful improvement in my test case numbers: CPU utilization
> went to 100% and the context swap rate to almost nil. Unfortunately,
> I also saw fairly frequent "stuck spinlock" panics when running
> more queries than there were processors --- this despite increasing
> NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield
> anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect.
> (I speculate that it's set up to only yield the processor to other
> processes already affiliated to that processor. In any case, it
> is definitely capable of getting through 10000 yields without
> running the guy who's holding the spinlock.)

This is intended behaviour of sched_yield.

http://lwn.net/Articles/31462/
http://marc.theaimsgroup.com/?l=linux-kernel&m=112432727428224&w=2

--
marko


From: "Michael Paesold" <mpaesold(at)gmx(dot)at>
To: "Michael Paesold" <mpaesold(at)gmx(dot)at>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>, "Stephen Frost" <sfrost(at)snowman(dot)net>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 10:37:31
Message-ID: 016101c5b84f$261e3c10$0f01a8c0@zaphod
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> I'll do tomorrow morning (CEST, i.e. in about 11 hours).

These are the tests with the change:
> if ((--spins % MAX_SPINS_PER_DELAY) == 0)
>
> to
>
> if (--spins == 0)

I have called the resulting patch (spin-delay + this change) spin-delay-2.

again with only slock-no-cmpb applied
1: 55s 4: 119s (cpu ~97%)

with only spin-delay-2 applied
1: 56s 4: 125s (cpu ~99.5%)

with slock-no-cmpb and spin-delay-2 applied
1: 56s 4: 125s (cpu ~100%)

(Note that the cpu averages are not calulated but estimated by looking at
vmstat.)

Yesterdays results:
> CVS tip from 2005-09-12 ~16:00
> 1: 57s 2: 82s 4: 124s 8: 237s
>
> with only slock-no-cmpb.patch applied
> 1: 55s 2: 79s 4: 119s 8: 229s
>
> with only spin-delay.patch applied
> 1: 56s 2: 79s 4: 124s 8: 235s
>
> with both patches applied
> 1: 55s 2: 78s 4: 124s 8: 235s

To have other data, I have retested the patches on a single-cpu Intel P4
3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 dual-Xeon
results it's clear that this is in reality only one cpu. While the runtime
for N=1 is better than the other system, for N=4 it's already worse. The
situation with the patches is quite different, though. Unfortunatly.

CVS tip from 2005-09-12:
1: 36s 2: 77s (cpu ~85%) 4: 159s (cpu ~98%)

only slock-no-cmpb:
1: 36s 2: 81s (cpu ~79%) 4: 177s (cpu ~94%)
(doesn't help this time)

only spin-delay:
1: 36s 2: 86s (cpu =100%) 4: 157s (cpu =100%)
(bad runtime for N=2 (repeatable), cpu not doing real work here?)

slock-no-cmpb and spin-delay:
1: 36s 2: 106s (cpu =100%) 4: 192s (cpu =100%)
(it gets worse)

only spin-delay-2:
1: 36s 2: 85s (cpu =100%) 4: 160s (cpu =100%)
(quite the same as spin-delay)

slock-no-cmpb and spin-delay-2:
1: 36s 2: 109s (cpu =100%) 4: 198s (cpu =100%)
(worse again)

CS rate was low (20-50) for all tests, increasing for N>2 which has to be
expected. For this system I see no case for applying these patches.

Is there a portable way to detect the CPU we are running on? Do you have any
other idea how to implement the delays?

Best Regards,
Michael Paesold


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Marko Kreen <marko(at)l-t(dot)ee>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 13:47:46
Message-ID: 87hdcp13lp.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Marko Kreen <marko(at)l-t(dot)ee> writes:

> > (I speculate that it's set up to only yield the processor to other
> > processes already affiliated to that processor. In any case, it
> > is definitely capable of getting through 10000 yields without
> > running the guy who's holding the spinlock.)

Maybe it should try sched_yield once and then use select after that?

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marko Kreen <marko(at)l-t(dot)ee>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 14:10:13
Message-ID: 22767.1126620613@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Marko Kreen <marko(at)l-t(dot)ee> writes:
> On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
>> However, given that we are only expecting
>> the spinlock to be held for a couple dozen instructions, using the
>> kernel futex mechanism is huge overkill --- the in-kernel overhead
>> to manage the futex state is almost certainly several orders of
>> magnitude more than the delay we actually want.

> Why do you think so? AFAIK on uncontented case there will be no
> kernel access, only atomic inc/dec.

In the uncontended case, we never even enter s_lock() and so the entire
mechanism of yielding is irrelevant. The problem that's being exposed
by these test cases is that on multiprocessors, you can see a
significant rate of spinlock contention (order of 100 events/second,
which is still a tiny fraction of the number of TAS calls) and our
existing mechanism for dealing with contention is just not efficient
enough.

> On contented case you'll want task switch anyway, so the futex
> managing should not matter.

No, we DON'T want a task switch. That's the entire point: in a
multiprocessor, it's a good bet that the spinlock is held by a task
running on another processor, and doing a task switch will take orders
of magnitude longer than just spinning until the lock is released.
You should yield only after spinning long enough to make it a strong
probability that the spinlock is held by a process that's lost the
CPU and needs to be rescheduled.

> If you don't want Linux-specific locking in core code, then
> it's another matter...

Well, it's true, we don't particularly want a one-platform solution,
but if it did what we wanted we might hold our noses and use it anyway.

(I think, BTW, that using futexes at the spinlock level is misguided;
what would be interesting would be to see if we could substitute for
both LWLock and spinlock logic with one futex-based module.)

>> I also saw fairly frequent "stuck spinlock" panics when running
>> more queries than there were processors --- this despite increasing
>> NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield
>> anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect.
>> (I speculate that it's set up to only yield the processor to other
>> processes already affiliated to that processor. In any case, it
>> is definitely capable of getting through 10000 yields without
>> running the guy who's holding the spinlock.)

> This is intended behaviour of sched_yield.

> http://lwn.net/Articles/31462/
> http://marc.theaimsgroup.com/?l=linux-kernel&m=112432727428224&w=2

No; that page still says specifically "So a process calling
sched_yield() now must wait until all other runnable processes in the
system have used up their time slices before it will get the processor
again." I can prove that that is NOT what happens, at least not on
a multi-CPU Opteron with current FC4 kernel. However, if the newer
kernels penalize a process calling sched_yield as heavily as this page
claims, then it's not what we want anyway ...

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Marko Kreen <marko(at)l-t(dot)ee>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 14:24:17
Message-ID: 87br2x11wu.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Marko Kreen <marko(at)l-t(dot)ee> writes:

> On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
> > The second reason that the futex patch is helping is that when
> > a spinlock delay does occur, it allows the delaying process to be
> > awoken almost immediately, rather than delaying 10 msec or more
> > as the existing code does. However, given that we are only expecting
> > the spinlock to be held for a couple dozen instructions, using the
> > kernel futex mechanism is huge overkill --- the in-kernel overhead
> > to manage the futex state is almost certainly several orders of
> > magnitude more than the delay we actually want.
>
> Why do you think so? AFAIK on uncontented case there will be no
> kernel access, only atomic inc/dec. On contented case you'll
> want task switch anyway, so the futex managing should not
> matter. Also this mechanism is specifically optimized for
> inter-process locking, I don't think you can get more efficient
> mechanism from side-effects from generic syscalls.
>
> If you don't want Linux-specific locking in core code, then
> it's another matter...

How does the futex using code compare with the new patches on this new
benchmark test?

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 14:39:38
Message-ID: 23028.1126622378@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Marko Kreen <marko(at)l-t(dot)ee> writes:
> (I speculate that it's set up to only yield the processor to other
> processes already affiliated to that processor. In any case, it
> is definitely capable of getting through 10000 yields without
> running the guy who's holding the spinlock.)

> Maybe it should try sched_yield once and then use select after that?

I tried that, actually, but it didn't seem to offer any particular
benefit. (Maybe it would have helped more on older Linux kernels before
they changed sched_yield?)

I'm feeling even more disenchanted with sched_yield now that Marko
pointed out that the behavior was changed recently. Here we have a
construct that is not portable cross-platform, does not act as
documented in its man page, and the kernel guys feel free to whack its
behavior around in major ways without documenting that either. It seems
to be a crap-shoot whether it will be useful on any particular machine
or not. At least with the select() code we can be reasonably confident
we know what will happen.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 15:37:24
Message-ID: 8764t50yiz.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> > On contented case you'll want task switch anyway, so the futex
> > managing should not matter.
>
> No, we DON'T want a task switch. That's the entire point: in a
> multiprocessor, it's a good bet that the spinlock is held by a task
> running on another processor, and doing a task switch will take orders
> of magnitude longer than just spinning until the lock is released.
> You should yield only after spinning long enough to make it a strong
> probability that the spinlock is held by a process that's lost the
> CPU and needs to be rescheduled.

Does the futex code make any attempt to record the CPU of the process grabbing
the lock? Clearly it wouldn't be a guarantee of anything but if it's only used
for short-lived spinlocks while acquiring longer lived locks then maybe?

> No; that page still says specifically "So a process calling
> sched_yield() now must wait until all other runnable processes in the
> system have used up their time slices before it will get the processor
> again." I can prove that that is NOT what happens, at least not on
> a multi-CPU Opteron with current FC4 kernel. However, if the newer
> kernels penalize a process calling sched_yield as heavily as this page
> claims, then it's not what we want anyway ...

Well it would be no worse than select or any other random i/o syscall.

It seems to me what you've found is an outright bug in the linux scheduler.
Perhaps posting it to linux-kernel would be worthwhile.

--
greg


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 16:11:03
Message-ID: 200509130911.03677.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom, All:

> It seems to me what you've found is an outright bug in the linux scheduler.
> Perhaps posting it to linux-kernel would be worthwhile.

For people using this on Linux 2.6, which scheduler are you using? Deadline
is the recommended one for databases, and does offer significant (+5-8%)
benefits in some heavy-contention environments, at least in OSDL tests.

--
Josh Berkus
Aglio Database Solutions
San Francisco


From: Douglas McNaught <doug(at)mcnaught(dot)org>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 16:21:45
Message-ID: m2wtll9bvq.fsf@Douglas-McNaughts-Powerbook.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:

> Tom, All:
>
>> It seems to me what you've found is an outright bug in the linux scheduler.
>> Perhaps posting it to linux-kernel would be worthwhile.
>
> For people using this on Linux 2.6, which scheduler are you using? Deadline
> is the recommended one for databases, and does offer significant (+5-8%)
> benefits in some heavy-contention environments, at least in OSDL tests.

I thought 'deadline' was an I/O scheduler, not a CPU scheduler?

-Doug


From: Douglas McNaught <doug(at)mcnaught(dot)org>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 16:24:04
Message-ID: m2slw99brv.fsf@Douglas-McNaughts-Powerbook.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:

> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
>> No; that page still says specifically "So a process calling
>> sched_yield() now must wait until all other runnable processes in the
>> system have used up their time slices before it will get the processor
>> again." I can prove that that is NOT what happens, at least not on
>> a multi-CPU Opteron with current FC4 kernel. However, if the newer
>> kernels penalize a process calling sched_yield as heavily as this page
>> claims, then it's not what we want anyway ...
>
> Well it would be no worse than select or any other random i/o syscall.
>
> It seems to me what you've found is an outright bug in the linux scheduler.
> Perhaps posting it to linux-kernel would be worthwhile.

People have complained on l-k several times about the 2.6
sched_yield() behavior; the response has basically been "if you rely
on any particular sched_yield() behavior for synchronization, your app
is broken--it's not a synchronization primitive."

-Doug


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Douglas McNaught <doug(at)mcnaught(dot)org>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 17:11:12
Message-ID: 87zmqgzydr.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Douglas McNaught <doug(at)mcnaught(dot)org> writes:

> > It seems to me what you've found is an outright bug in the linux scheduler.
> > Perhaps posting it to linux-kernel would be worthwhile.
>
> People have complained on l-k several times about the 2.6
> sched_yield() behavior; the response has basically been "if you rely
> on any particular sched_yield() behavior for synchronization, your app
> is broken--it's not a synchronization primitive."

They're not talking about this. They're talking about applications that spin
on sched_yield() and expect it to reduce cpu load as if the process were
calling sleep().

What Tom found was that some processes are never scheduled when sched_yield is
called. There's no reason that should be happening.

--
greg


From: Douglas McNaught <doug(at)mcnaught(dot)org>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 17:21:32
Message-ID: m2mzmganoj.fsf@Douglas-McNaughts-Powerbook.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:

> What Tom found was that some processes are never scheduled when sched_yield is
> called. There's no reason that should be happening.

Yeah, that would probably be a bug...

-Doug


From: Mark Wong <markw(at)osdl(dot)org>
To: Douglas McNaught <doug(at)mcnaught(dot)org>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 17:32:42
Message-ID: 200509131733.j8DHX1Bo003723@smtp.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 13 Sep 2005 12:21:45 -0400
Douglas McNaught <doug(at)mcnaught(dot)org> wrote:

> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>
> > Tom, All:
> >
> >> It seems to me what you've found is an outright bug in the linux scheduler.
> >> Perhaps posting it to linux-kernel would be worthwhile.
> >
> > For people using this on Linux 2.6, which scheduler are you using? Deadline
> > is the recommended one for databases, and does offer significant (+5-8%)
> > benefits in some heavy-contention environments, at least in OSDL tests.
>
> I thought 'deadline' was an I/O scheduler, not a CPU scheduler?

Yes, that's correct. That's an i/o elevator algorithm.

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Douglas McNaught <doug(at)mcnaught(dot)org>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 17:38:08
Message-ID: 25267.1126633088@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Douglas McNaught <doug(at)mcnaught(dot)org> writes:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
>> What Tom found was that some processes are never scheduled when sched_yield is
>> called. There's no reason that should be happening.

> Yeah, that would probably be a bug...

I suspect the kernel hackers might consider it deliberate in a
NUMA-aware kernel.

regards, tom lane


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 18:34:31
Message-ID: 20050913183431.GK6026@ns.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> I'm feeling even more disenchanted with sched_yield now that Marko
> pointed out that the behavior was changed recently. Here we have a

To be fair, I'm not entirely sure 'recently' is quite the right word.
It sounds like it changed during the 2.5 development cycle; which was
between major kernel releases, and looks like it was around Dec, 2003.

Thanks,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Michael Paesold" <mpaesold(at)gmx(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org, "Stephen Frost" <sfrost(at)snowman(dot)net>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 19:40:48
Message-ID: 26291.1126640448@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Michael Paesold" <mpaesold(at)gmx(dot)at> writes:
> To have other data, I have retested the patches on a single-cpu Intel P4
> 3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 dual-Xeon
> results it's clear that this is in reality only one cpu. While the runtime
> for N=1 is better than the other system, for N=4 it's already worse. The
> situation with the patches is quite different, though. Unfortunatly.

> CVS tip from 2005-09-12:
> 1: 36s 2: 77s (cpu ~85%) 4: 159s (cpu ~98%)

> only slock-no-cmpb:
> 1: 36s 2: 81s (cpu ~79%) 4: 177s (cpu ~94%)
> (doesn't help this time)

Hm. This is the first configuration we've seen in which slock-no-cmpb
was a loss. Could you double-check that result?

> Is there a portable way to detect the CPU we are running on? Do you have any
> other idea how to implement the delays?

I can't see any reasonable way to do runtime switching of the cmpb test
--- whatever logic we put in to control it would cost as much or more
than the cmpb anyway :-(. I think that has to be a compile-time choice.
>From my perspective it'd be acceptable to remove the cmpb only for
x86_64, since only there does it seem to be a really significant win.
On the other hand it seems that removing the cmpb is a net win on most
x86 setups too, so maybe we should just do it and accept that there are
some cases where it's not perfect. Looking back at the original
discussion, I see that the cmpb was added with little discussion and no
specific performance testing, so it shouldn't be presumed to have a lot
of weight behind it:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00352.php

The other thing that's going on in the patch is dynamic adjustment of
spins_per_delay, and that could certainly be made much smarter than it
is now. For instance, it'd cost almost nothing to make the lower and
upper bounds for spins_per_delay be variables instead of constants.
We could set the bounds during postmaster startup based on some criteria
yet to be determined.

regards, tom lane


From: Marko Kreen <marko(at)l-t(dot)ee>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 20:46:11
Message-ID: 20050913204611.GA24295@l-t.ee
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 13, 2005 at 10:10:13AM -0400, Tom Lane wrote:
> Marko Kreen <marko(at)l-t(dot)ee> writes:
> > On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
> >> However, given that we are only expecting
> >> the spinlock to be held for a couple dozen instructions, using the
> >> kernel futex mechanism is huge overkill --- the in-kernel overhead
> >> to manage the futex state is almost certainly several orders of
> >> magnitude more than the delay we actually want.
>
> > Why do you think so? AFAIK on uncontented case there will be no
> > kernel access, only atomic inc/dec.
>
> In the uncontended case, we never even enter s_lock() and so the entire
> mechanism of yielding is irrelevant. The problem that's being exposed
> by these test cases is that on multiprocessors, you can see a
> significant rate of spinlock contention (order of 100 events/second,
> which is still a tiny fraction of the number of TAS calls) and our
> existing mechanism for dealing with contention is just not efficient
> enough.
>
> > On contented case you'll want task switch anyway, so the futex
> > managing should not matter.
>
> No, we DON'T want a task switch. That's the entire point: in a
> multiprocessor, it's a good bet that the spinlock is held by a task
> running on another processor, and doing a task switch will take orders
> of magnitude longer than just spinning until the lock is released.
> You should yield only after spinning long enough to make it a strong
> probability that the spinlock is held by a process that's lost the
> CPU and needs to be rescheduled.


Hmm. I guess this could be separated into 2 cases:

1. Light load - both lock owner and lock requester wont get
scheduled while busy (owner in critical section, waiter
spinning.)
2. Big load - either or both of them gets scheduled while busy.
(waiter is scheduled by OS or voluntarily by eg. calling select())

So my impression is that currently you optimize for #1 at the
expense of #2, while with futexes you'd optimize for #2 at
the expense of #1. Additionally I'm pretty convinced that
futexes give you most efficient implementation for #2, as kernel
knows what processes are waiting on particular lock so it can
make best decisions for scheduling.

> > If you don't want Linux-specific locking in core code, then
> > it's another matter...
>
> Well, it's true, we don't particularly want a one-platform solution,
> but if it did what we wanted we might hold our noses and use it anyway.
>
> (I think, BTW, that using futexes at the spinlock level is misguided;
> what would be interesting would be to see if we could substitute for
> both LWLock and spinlock logic with one futex-based module.)

Use pthreads ;)

> >> I also saw fairly frequent "stuck spinlock" panics when running
> >> more queries than there were processors --- this despite increasing
> >> NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield
> >> anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect.
> >> (I speculate that it's set up to only yield the processor to other
> >> processes already affiliated to that processor. In any case, it
> >> is definitely capable of getting through 10000 yields without
> >> running the guy who's holding the spinlock.)
>
> > This is intended behaviour of sched_yield.
>
> > http://lwn.net/Articles/31462/
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=112432727428224&w=2
>
> No; that page still says specifically "So a process calling
> sched_yield() now must wait until all other runnable processes in the
> system have used up their time slices before it will get the processor
> again." I can prove that that is NOT what happens, at least not on
> a multi-CPU Opteron with current FC4 kernel. However, if the newer
> kernels penalize a process calling sched_yield as heavily as this page
> claims, then it's not what we want anyway ...

My fault. As I saw that there is problem with sched_yield, I
said "I bet this is because of behaviour change" and only
skimmed the exact details. But the point that sched_yield is
not meant for such usage still stands.

About fast yielding, comment on sys_sched_yield() says:

* sys_sched_yield - yield the current processor to other threads.
*
* this function yields the current CPU by moving the calling thread
* to the expired array. If there are no other threads running on this
* CPU then this function will return.

So there just is nothing else to schedule on that CPU.

--
marko


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marko Kreen <marko(at)l-t(dot)ee>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 21:18:23
Message-ID: 26803.1126646303@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Marko Kreen <marko(at)l-t(dot)ee> writes:
> Hmm. I guess this could be separated into 2 cases:

> 1. Light load - both lock owner and lock requester wont get
> scheduled while busy (owner in critical section, waiter
> spinning.)
> 2. Big load - either or both of them gets scheduled while busy.
> (waiter is scheduled by OS or voluntarily by eg. calling select())

Don't forget that the coding rules for our spinlocks say that you
mustn't hold any such lock for more than a couple dozen instructions,
and certainly any kernel call while holding the lock is Right Out.
There is *no* case where the holder of a spinlock is going to
voluntarily give up the CPU. The design intention was that the
odds of losing the CPU while holding a spinlock would be negligibly
small, simply because we don't hold it very long.

> About fast yielding, comment on sys_sched_yield() says:
> * sys_sched_yield - yield the current processor to other threads.
> *
> * this function yields the current CPU by moving the calling thread
> * to the expired array. If there are no other threads running on this
> * CPU then this function will return.

Mph. So that's pretty much exactly what I suspected...

I just had a thought: it seems that the reason we are seeing a
significant issue here is that on SMP machines, the cost of trading
exclusively-owned cache lines back and forth between processors is
so high that the TAS instructions (specifically the xchgb, in the x86
cases) represent a significant fraction of backend execution time all
by themselves. (We know this is the case due to oprofile results,
see discussions from last April.) What that means is that there's a
fair chance of a process losing its timeslice immediately after the
xchgb. Which is precisely the scenario we do not want, if the process
successfully acquired the spinlock by means of the xchgb.

We could ameliorate this if there were a way to acquire ownership of the
cache line without necessarily winning the spinlock. I'm imagining
that we insert a "dummy" locked instruction just ahead of the xchgb,
which touches the spinlock in such a way as to not change its state.
(xchgb won't do for this, but maybe one of the other lockable
instructions will.) We do the xchgb just after this one. The idea is
that if we don't own the cache line, the first instruction causes it to
be faulted into the processor's cache, and if our timeslice expires
while that is happening, we lose the processor without having acquired
the spinlock. This assumes that once we've got the cache line, the
xchgb that actually does the work can get executed with not much
extra time spent and only low probability of someone else stealing the
cache line back first.

The fact that cmpb isn't helping proves that getting the cache line in a
read-only fashion does *not* do enough to protect the xchgb in this way.
But maybe another locking instruction would. Comments?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marko Kreen <marko(at)l-t(dot)ee>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 22:02:52
Message-ID: 27015.1126648972@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> We could ameliorate this if there were a way to acquire ownership of the
> cache line without necessarily winning the spinlock.

Another thought came to mind: maybe the current data layout for LWLocks
is bad. Right now, the spinlock that protects each LWLock data struct
is itself part of the struct, and since the structs aren't large (circa
20 bytes), the whole thing is usually all in the same cache line. This
had seemed like a good idea at the time, on the theory that once you'd
obtained the spinlock you'd also have pulled in the LWLock contents.
But that's only optimal under an assumption of low contention. If
there's high contention for the spinlock, then another processor
spinning on the lock will be continuously taking away ownership of the
cache line and thus slowing down the guy who's got the lock and is
trying to examine/update the LWLock contents.

Maybe it'd be better to allocate the spinlocks off by themselves.
Then, spinning processors would not affect the processor that's updating
the LWLock; only when it finishes doing that and needs to clear the
spinlock will it have to contend with the spinners for the cache line
containing the spinlock.

This would add an instruction or so to LWLockAcquire and LWLockRelease,
and would be of no benefit on uniprocessors, but it might be worth doing
for multiprocessors. Another patch to test ...

I'm starting to think that we might have to succumb to having a compile
option "optimize for multiprocessor" or "optimize for single processor".
It's pretty hard to see how we'd alter a data structure decision like
this on the fly.

regards, tom lane


From: "Min Xu (Hsu)" <xu(at)cs(dot)wisc(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 23:32:02
Message-ID: 20050913233155.GJ5161@cs.wisc.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 13 Sep 2005 Tom Lane wrote :
> I wrote:
> > We could ameliorate this if there were a way to acquire ownership of the
> > cache line without necessarily winning the spinlock.
>
> Another thought came to mind: maybe the current data layout for LWLocks
> is bad. Right now, the spinlock that protects each LWLock data struct
> is itself part of the struct, and since the structs aren't large (circa
> 20 bytes), the whole thing is usually all in the same cache line. This
> had seemed like a good idea at the time, on the theory that once you'd
> obtained the spinlock you'd also have pulled in the LWLock contents.
> But that's only optimal under an assumption of low contention. If
> there's high contention for the spinlock, then another processor
> spinning on the lock will be continuously taking away ownership of the
> cache line and thus slowing down the guy who's got the lock and is
> trying to examine/update the LWLock contents.

If this were the case, perhaps first fetch the spin lock with read-only
permission should have helped. Modern processors have store buffers to
not let a store miss to slow down the processor. Therefore, if processor
A has the spin lock and it is examining and updating the LWLock on the
the same cache line, as long as processor B doesn't try to write the
same cache line, processor A won't be slowed down. What would happen
is that multiple writing are coalescing in the write buffer and they
are going to be flushed to the memory when processor A regain the write
permission.

I still think your first scenario is possible. That's when final
processor B gets the lock, its time slice runs out. This sounds to
me is likely to cause a chain of context switches.

> Maybe it'd be better to allocate the spinlocks off by themselves.
> Then, spinning processors would not affect the processor that's updating
> the LWLock; only when it finishes doing that and needs to clear the
> spinlock will it have to contend with the spinners for the cache line
> containing the spinlock.
>
> This would add an instruction or so to LWLockAcquire and LWLockRelease,
> and would be of no benefit on uniprocessors, but it might be worth doing
> for multiprocessors. Another patch to test ...
>
> I'm starting to think that we might have to succumb to having a compile
> option "optimize for multiprocessor" or "optimize for single processor".
> It's pretty hard to see how we'd alter a data structure decision like
> this on the fly.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 23:49:01
Message-ID: 20050913234901.GN6026@ns.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> I'm starting to think that we might have to succumb to having a compile
> option "optimize for multiprocessor" or "optimize for single processor".
> It's pretty hard to see how we'd alter a data structure decision like
> this on the fly.

I'd really hate to see this happen. In this case I don't think the
change you're proposing would have much of an impact on a uniprocessor
machine. Having seperate compile-time options for uniprocessor and
multiprocessor would open the gates for potentially other changes which
*would* have a more serious impact on one or the other when compiled for
the opposite. I think this would be a serious problem for binary
distributions and correspondingly their users.

Happy to test these changes, btw.

Also, I'm redoing my tests from the other patches with all the various
combinations; I'm a bit concerned that the recompiles I did (which were
just 'make', not 'make clean', 'make') for them previously didn't
actually recompile everything necessary. Sorry if the results differ.
Hope to have that all done this evening.

Thanks,

Stephen


From: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 00:05:59
Message-ID: Pine.LNX.4.58.0509140956450.2780@linuxworld.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 13 Sep 2005, Stephen Frost wrote:

> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> > I'm starting to think that we might have to succumb to having a compile
> > option "optimize for multiprocessor" or "optimize for single processor".
> > It's pretty hard to see how we'd alter a data structure decision like
> > this on the fly.
>
> I'd really hate to see this happen. In this case I don't think the
> change you're proposing would have much of an impact on a uniprocessor
> machine. Having seperate compile-time options for uniprocessor and
> multiprocessor would open the gates for potentially other changes which
> *would* have a more serious impact on one or the other when compiled for
> the opposite. I think this would be a serious problem for binary
> distributions and correspondingly their users.

It does make it painful for distribution/package maintainers but I think
the potential benefits for single/multi-CPU architectures are high. It
means that our lock intrinsic on uniprocessors can just be a lock/delay
loop without any spinning -- which is a complete waste of time if you only
have one CPU.

Doing this at runtime involvevs some pretty significant uglification of
low level code I think.

Gavin


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 00:14:47
Message-ID: 20050914001447.GO6026@ns.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Gavin Sherry (swm(at)linuxworld(dot)com(dot)au) wrote:
> It does make it painful for distribution/package maintainers but I think
> the potential benefits for single/multi-CPU architectures are high. It
> means that our lock intrinsic on uniprocessors can just be a lock/delay
> loop without any spinning -- which is a complete waste of time if you only
> have one CPU.
>
> Doing this at runtime involvevs some pretty significant uglification of
> low level code I think.

I suspect distributors would go for the multi-cpu setup (especially if
a uniprocessor build is *broken* for multiprocessor) and then in a
lot of cases you end up not actually getting any benefit. I'm afraid
you'd also end up having to tell alot of people who complain to
recompile, who will then complain back to their distributors, etc.

I suppose another option would be to provide seperate packages... Could
this be done as a shared library so it's more 'plug-and-play' to switch
between the two? I dunno, just trying to think about how to deal with
this without making it suck terribly bad for me (as a Debian user and
maintainer).

Certainly makes me wish there was a way to do it which made the kernel
handle the uniprocessor/multiprocessor question and did locking as
appropriate for those. Ah, well.

Thanks,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Min Xu (Hsu)" <xu(at)cs(dot)wisc(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 02:42:59
Message-ID: 28419.1126665779@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Min Xu (Hsu)" <xu(at)cs(dot)wisc(dot)edu> writes:
> ...If this were the case, perhaps first fetch the spin lock with read-only
> permission should have helped.

But the cmpb instruction in the 8.0 version of TAS would have done that,
and I think we've already established that the cmpb is a loss on most
machines (except maybe single-physical-CPU Xeons). I suggested in my
other message that it might help to grab write permission on the cache
line before actually trying to acquire the spin lock --- but I don't see
how getting read-only permission first will help.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 02:54:59
Message-ID: 28489.1126666499@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> I suspect distributors would go for the multi-cpu setup (especially if
> a uniprocessor build is *broken* for multiprocessor) and then in a
> lot of cases you end up not actually getting any benefit. I'm afraid
> you'd also end up having to tell alot of people who complain to
> recompile, who will then complain back to their distributors, etc.

Yeah. Being in charge of Red Hat's packaging of PG, I feel that pain as
keenly as anybody ... and I *know* RH will not be interested in shipping
two different packages. If we go this way, the RH distributions will
use the --optimize-multi switch, because that's where the money is.

The bottom line here is that we will have to make some compromises:
if we want one-size-fits-all code, it will not be optimal for every
single architecture. If we don't do one-size-fits-all, then we will
pay for it in various other ways.

regards, tom lane


From: "Min Xu (Hsu)" <xu(at)cs(dot)wisc(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 02:55:38
Message-ID: 20050914025533.GK5161@cs.wisc.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 13 Sep 2005 Tom Lane wrote :
> "Min Xu (Hsu)" <xu(at)cs(dot)wisc(dot)edu> writes:
> > ...If this were the case, perhaps first fetch the spin lock with read-only
> > permission should have helped.
>
> But the cmpb instruction in the 8.0 version of TAS would have done that,
> and I think we've already established that the cmpb is a loss on most
> machines (except maybe single-physical-CPU Xeons). I suggested in my
> other message that it might help to grab write permission on the cache
> line before actually trying to acquire the spin lock --- but I don't see
> how getting read-only permission first will help.

Yes, I agree. What I was trying to say was that if the second scenario
you hypothesize were true, i.e. fetching write-able line before actually
trying to acquire the spin lock would cause another processor to slow
down its execution inside the critical section, then fetching read-only
lines would have helped. As you said, however, experimental results
shows fetching read-only lines didn't help, which led me wonder whether the
second scenario your described was really happening.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Min Xu (Hsu)" <xu(at)cs(dot)wisc(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 03:05:26
Message-ID: 28568.1126667126@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Min Xu (Hsu)" <xu(at)cs(dot)wisc(dot)edu> writes:
> ... As you said, however, experimental results
> shows fetching read-only lines didn't help, which led me wonder whether the
> second scenario your described was really happening.

I don't know --- we haven't tried it. I do intend to work up some
patches for these ideas, and then we can test them.

Or, if you happen to be wide awake at the moment, feel free to take a
shot at it ... I'm out of steam and won't get to it before tomorrow...

regards, tom lane


From: "Min Xu (Hsu)" <xu(at)cs(dot)wisc(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 03:22:44
Message-ID: 20050914032243.GM5161@cs.wisc.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 13 Sep 2005 Tom Lane wrote :
> "Min Xu (Hsu)" <xu(at)cs(dot)wisc(dot)edu> writes:
> > ... As you said, however, experimental results
> > shows fetching read-only lines didn't help, which led me wonder whether the
> > second scenario your described was really happening.
>
> I don't know --- we haven't tried it. I do intend to work up some
> patches for these ideas, and then we can test them.

I see. I look forward to it.

> Or, if you happen to be wide awake at the moment, feel free to take a
> shot at it ... I'm out of steam and won't get to it before tomorrow...

I am newbie here. Please don't let my comments discourage you from
doing anything. I enjoyed very much following your discussion and analysis
on this problem.


From: "Michael Paesold" <mpaesold(at)gmx(dot)at>
To: "Min Xu (Hsu)" <xu(at)cs(dot)wisc(dot)edu>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 07:15:54
Message-ID: 017501c5b8fc$25c4efe0$0f01a8c0@zaphod
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> But the cmpb instruction in the 8.0 version of TAS would have done that,
> and I think we've already established that the cmpb is a loss on most
> machines (except maybe single-physical-CPU Xeons).

Note that this was a regular Pentium 4 system, not a Xeon.

Best Regards,
Michael Paesold


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 07:30:05
Message-ID: 87d5ncm7ia.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:

> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> > I'm starting to think that we might have to succumb to having a compile
> > option "optimize for multiprocessor" or "optimize for single processor".
> > It's pretty hard to see how we'd alter a data structure decision like
> > this on the fly.
>
> I'd really hate to see this happen.

I suspect you're thinking too hard about this. It's not like some LWLocks
would need SMP spinlocks and others uniprocessor locks. And it's not like it's
going to change dynamically.

Wouldn't it just be a couple:

if (multiprocessor) {
complex version
} else {
simple version
}

Ugly to be sure but it's not going to spread to other areas of the source base
and you know the if statement isn't going to be growing any more arms.

It's no uglier than doing the same thing with an #ifdef anyways.

If you really really want to have two versions and you think using function
pointers would impose too big a performance penalty you could play linker
games. But that way lies madness.

--
greg


From: "Michael Paesold" <mpaesold(at)gmx(dot)at>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>, "Stephen Frost" <sfrost(at)snowman(dot)net>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 07:41:46
Message-ID: 017b01c5b8ff$c3280bc0$0f01a8c0@zaphod
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> "Michael Paesold" <mpaesold(at)gmx(dot)at> writes:
>> To have other data, I have retested the patches on a single-cpu Intel P4
>> 3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4
>> dual-Xeon
>> results it's clear that this is in reality only one cpu. While the
>> runtime
>> for N=1 is better than the other system, for N=4 it's already worse. The
>> situation with the patches is quite different, though. Unfortunatly.
>
>> CVS tip from 2005-09-12:
>> 1: 36s 2: 77s (cpu ~85%) 4: 159s (cpu ~98%)
>
>> only slock-no-cmpb:
>> 1: 36s 2: 81s (cpu ~79%) 4: 177s (cpu ~94%)
>> (doesn't help this time)
>
> Hm. This is the first configuration we've seen in which slock-no-cmpb
> was a loss. Could you double-check that result?

The first tests were compiled with
CFLAGS='-O2 -mcpu=pentium4 -march=pentium4'. I had redone the tests with
just CFLAGS='-O2' yesterday. The difference was just about a second, but the
result from the patch was the same. The results for N=4 and N=8 show the
positive effect more clearly.

configure: CFLAGS='-O2' --enable-casserts
On RHEL 4.1, gcc (GCC) 3.4.3 20050227 (Red Hat 3.4.3-22.1)

CVS tip from 2005-09-12:
1: 37s 2: 78s 4: 159s 8: 324

only slock-no-cmpb:
1: 37s 2: 82s (5%) 4: 178s (12%) 8: 362 (12%)

configure: --enable-casserts

(Btw. I have always done "make clean ; make ; make install" between tests)

Best Regards,
Michael Paesold

> I can't see any reasonable way to do runtime switching of the cmpb test
> --- whatever logic we put in to control it would cost as much or more
> than the cmpb anyway :-(. I think that has to be a compile-time choice.
> From my perspective it'd be acceptable to remove the cmpb only for
> x86_64, since only there does it seem to be a really significant win.
> On the other hand it seems that removing the cmpb is a net win on most
> x86 setups too, so maybe we should just do it and accept that there are
> some cases where it's not perfect.

How many test cases do we have yet?
Summary of the effects without the cmpb instruction seems to be:

8-way Opteron: better
Dual/HT Xeon w/o EM64T: better
Dual/HT EM64T: better for N<=cpus, worse for N>cpus (Stephen's)
HT P4 w/o EM64T: worse (stronger for N>cpus)

Have I missed other reports that did test the slock-no-cmpb.patch alone?
Two of the systems with positive effects are x86_64, one is an older
high-end Intel x86 chip. The negative effect is on a low-cost Pentium 4 with
only hyper threading. According to the mentions thread's title, this was an
optimization for hyperthreading, not regular multi-cpus.

We could have more data, especially on newer and high-end systems. Could
some of you test the slock-no-cmpb.patch? You'll need an otherwise idle
system to get repeatable results.

http://archives.postgresql.org/pgsql-hackers/2005-09/msg00565.php
http://archives.postgresql.org/pgsql-hackers/2005-09/msg00566.php

I have re-attached the relevant files from Tom's posts because in the
archive it's not clear anymore what should go into which file. See
instructions in the first messages above.

The patch applies to CVS tip with
patch -p1 < slock-no-cmpb.patch

Best Regards,
Michael Paesold

Attachment Content-Type Size
slock-no-cmpb.patch application/octet-stream 869 bytes
test_setup.sql application/octet-stream 1.1 KB
test_run_small.sql application/octet-stream 205 bytes
startn.sh application/octet-stream 145 bytes

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 08:25:09
Message-ID: 20050914082504.GA29066@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 13, 2005 at 08:14:47PM -0400, Stephen Frost wrote:
> I suppose another option would be to provide seperate packages... Could
> this be done as a shared library so it's more 'plug-and-play' to switch
> between the two? I dunno, just trying to think about how to deal with
> this without making it suck terribly bad for me (as a Debian user and
> maintainer).

Note that the Linux kernel has played with moving spinlock code out of
line. Due to the effect of having so many, it ended up that the memory
saved by moving the code out of line actually benefitted overall. An
unconditional call to a function can't be that expensive, surely.

However, it would *have* to be in the same compile unit, no shared
libraries. ELF imposes some overhead for calls in different compile
units, using function pointers won't save you (see Ulrich Drepper
paper on it).

However, to make it flexible you would need a pointer to a function
pointer. Fortunatly these variables won't change often so the function
pointer and the function itself should be in the cache if used often
enough.

Finally, the kernel solves the problem by saying, if you compile for
uniprocessor, optimize the code away, since a lot of the issues don't
apply. My main concern is how you even detect the number of processors
in a portable way...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 16:19:35
Message-ID: 20050914161935.GR6026@ns.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom, et al.,

Updated, with full recompiles between everything and the new
modification:

N, runtime:
Tip: 1 31s 2 37s 4 86s 8 159s
no-cmpb: 1 32s 2 43s 4 83s 8 168s
spin: 1 32s 2 51s 4 84s 8 160s
spin+mod: 1 32s 2 51s 4 89s 8 158s
spin+no-cmpb: 1 32s 2 51s 4 87s 8 163s
spin+mod+no-cmpb: 1 32s 2 50s 4 86s 8 161s

Unfortunately, the results don't seem to be terribly consistent between
runs anyway:

Run 2:
Tip: 1 32s 2 43s 4 87s 8 160s
no-cmpb: 1 31s 2 47s 4 83s 8 167s
spin: 1 32s 2 52s 4 88s 8 154s
spin+no-cmpb: 1 32s 2 51s 4 102s 8 166s
spin+mod: 1 32s 2 53s 4 85s 8 154s
spin+mod+no-cmpb: 1 32s 2 51s 4 91s 8 161s

Hope it helps,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marko Kreen <marko(at)l-t(dot)ee>
Cc: pgsql-hackers(at)postgresql(dot)org, "Michael Paesold" <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 16:58:21
Message-ID: 5740.1126717101@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> We could ameliorate this if there were a way to acquire ownership of the
> cache line without necessarily winning the spinlock. I'm imagining
> that we insert a "dummy" locked instruction just ahead of the xchgb,
> which touches the spinlock in such a way as to not change its state.

I tried this, using this tas code:

static __inline__ int
tas(volatile slock_t *lock)
{
register slock_t _res = 1;
register slock_t _dummy = 0;

/* Use a locking test before trying to take the spinlock */
/* xchg implies a LOCK prefix, so no need to say LOCK for it */
__asm__ __volatile__(
" lock \n"
" xaddb %2,%1 \n"
" xchgb %0,%1 \n"
: "+q"(_res), "+m"(*lock), "+q"(_dummy)
:
: "memory", "cc");
return (int) _res;
}

At least on Opteron, it's a loser. The previous best results (with
slock-no-cmpb and spin-delay patches) were
1 31s 2 42s 4 51s 8 100s
and with this instead of slock-no-cmpb,
1 33s 2 45s 4 55s 8 104s

The xadd may indeed be helping in terms of protecting the xchg from
end-of-timeslice --- the rate of select() delays is really tiny, one
every few seconds, which is better than I saw before. But the extra
cost of the extra locked operation isn't getting repaid overall.
Feel free to try it on other hardware, but it doesn't look promising.

BTW, I also determined that on that 4-way Opteron box, the integer
modulo idea doesn't make any difference --- that is, spin-delay and
what Michael called spin-delay-2 are the same speed. I think I had
tried the modulo before adding the variable spin delay, and it did
help in that configuration; but most likely, it was just helping by
stretching out the amount of time spent looping before entering the
kernel. So we can drop that idea too.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marko Kreen <marko(at)l-t(dot)ee>
Cc: pgsql-hackers(at)postgresql(dot)org, "Michael Paesold" <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 17:32:43
Message-ID: 5962.1126719163@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Another thought came to mind: maybe the current data layout for LWLocks
> is bad. Right now, the spinlock that protects each LWLock data struct
> is itself part of the struct, and since the structs aren't large (circa
> 20 bytes), the whole thing is usually all in the same cache line. ...
> Maybe it'd be better to allocate the spinlocks off by themselves.

Well, this is odd. I made up a patch to do this (attached) and found
that it pretty much sucks. Still the 4-way Opteron, previous best
(slock-no-cmpb and spin-delay-2):
1 31s 2 42s 4 51s 8 100s
with lwlock-separate added:
1 31s 2 52s 4 106s 8 213s

What I had expected to see was a penalty in the single-thread case (due
to instructions added to the LWLock functions), but there isn't any.
I did not expect to see a factor-of-2 penalty for four threads.

I guess what this means is that there's no real problem with losing the
cache line while manipulating the LWLock, which is what the patch was
intended to prevent. Instead, we're paying for swapping two cache lines
(the spinlock and the LWLock) across processors instead of just one line.
But that should at worst be a 2x inflation of the time previously spent
in LWLockAcquire/Release, which is surely not yet all of the application
;-). Why the heck is this so bad? Should we expect that apparently
minor changes in shared data structures might be costing equivalently
huge penalties in SMP performance elsewhere?

Unfortunately I don't have root on the Opteron and can't run oprofile.
But I'd really like to see some oprofile stats from these two cases
so we can figure out what in the world is going on here. Can anyone
help?

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 8.5 KB

From: "Michael Paesold" <mpaesold(at)gmx(dot)at>
To: "Marko Kreen" <marko(at)l-t(dot)ee>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-14 20:56:22
Message-ID: 002c01c5b96e$c3c95560$0f01a8c0@zaphod
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

>I wrote:
>> We could ameliorate this if there were a way to acquire ownership of the
>> cache line without necessarily winning the spinlock. I'm imagining
>> that we insert a "dummy" locked instruction just ahead of the xchgb,
>> which touches the spinlock in such a way as to not change its state.
>
> I tried this, using this tas code:
...

I have tried it on the P4 w/ HT.

CVS tip 1: 37s 2: 78s 4: 159s 8: 324
only slock-no-cmpb 1: 37s 2: 82s 4: 178s 8: 362
only dummy-locking 1: 44s 2: 96s 4: 197s ...

I guess there is no reason to try any more.

Best Regards,
Michael Paesold


From: "Michael Paesold" <mpaesold(at)gmx(dot)at>
To: "Marko Kreen" <marko(at)l-t(dot)ee>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-15 21:05:19
Message-ID: 014801c5ba39$2e96bca0$0f01a8c0@zaphod
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I guess what this means is that there's no real problem with losing the
> cache line while manipulating the LWLock, which is what the patch was
> intended to prevent. Instead, we're paying for swapping two cache lines
> (the spinlock and the LWLock) across processors instead of just one line.
> But that should at worst be a 2x inflation of the time previously spent
> in LWLockAcquire/Release, which is surely not yet all of the application
> ;-). Why the heck is this so bad? Should we expect that apparently
> minor changes in shared data structures might be costing equivalently
> huge penalties in SMP performance elsewhere?
>
> Unfortunately I don't have root on the Opteron and can't run oprofile.
> But I'd really like to see some oprofile stats from these two cases
> so we can figure out what in the world is going on here. Can anyone
> help?

I will try the patch here and see if it gives the same result. If so I could
try to run with oprofile if you can give me a quick start.

Best Regards,
Michael Paesold


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, "Michael Paesold" <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-15 21:49:03
Message-ID: 16824.1126820943@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> I guess what this means is that there's no real problem with losing the
> cache line while manipulating the LWLock, which is what the patch was
> intended to prevent. Instead, we're paying for swapping two cache lines
> (the spinlock and the LWLock) across processors instead of just one line.
> But that should at worst be a 2x inflation of the time previously spent
> in LWLockAcquire/Release, which is surely not yet all of the application
> ;-). Why the heck is this so bad? Should we expect that apparently
> minor changes in shared data structures might be costing equivalently
> huge penalties in SMP performance elsewhere?

I did some oprofile work and found that the cost seems to be because
(1) there's an increase in spinlock contention (more time spent in
s_lock), and (2) there's more time spent in LWLockAcquire/Release.
I'm not entirely clear about the reason for (1), but (2) apparently
is because of the extra cache line swapping, as posited above. In
the oprofile trace it's clear that the extra cost comes exactly in the
statements that touch fields of the shared LWLock, which are the places
where you might have to wait to acquire a cache line.

I thought for a bit that the problem might come from having chosen to
put a pointer to the spinlock into each LWLock; fetching that pointer
implies an additional access to the contended cache line. However,
changing the data structure to a simple linear array of spinlocks
didn't help. So that whole idea seems to have died a painful death.

One other interesting result is that the data structure change neither
helped nor hurt on an EM64T machine with 2 physical (4 logical)
processors. This is also x86_64, but evidently the caching behavior
is totally different from Opterons.

One thing that did seem to help a little bit was padding the LWLocks
to 32 bytes (by default they are 24 bytes each on x86_64) and ensuring
the array starts on a 32-byte boundary. This ensures that we won't have
any LWLocks crossing cache lines --- contended access to such an LWLock
would probably incur the sort of large penalty seen above, because you'd
be trading two cache lines back and forth not one. It seems that the
important locks are not split that way in CVS tip, because the gain
wasn't much, but I wonder whether some effect like this might explain
some of the unexplainable performance changes we've noticed in the past
(eg, in the dbt2 results). A seemingly unrelated small change in the
size of other data structures in shared memory might move things around
enough to make a performance-critical lock cross a cache line boundary.

On regular x86, the LWLock struct is by chance exactly 16 bytes, so
there's no alignment issue. But 64-bit platforms and platforms where
slock_t is bigger than char are exposed to this risk.

I'm going to go ahead and make that change, since it doesn't seem likely
to have any downside. It might be interesting to look into forcing
proper alignment of the shared buffer headers as well.

regards, tom lane


From: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-15 23:56:01
Message-ID: Pine.LNX.4.58.0509160952580.22114@linuxworld.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 15 Sep 2005, Tom Lane wrote:

> One thing that did seem to help a little bit was padding the LWLocks
> to 32 bytes (by default they are 24 bytes each on x86_64) and ensuring
> the array starts on a 32-byte boundary. This ensures that we won't have
> any LWLocks crossing cache lines --- contended access to such an LWLock
> would probably incur the sort of large penalty seen above, because you'd
> be trading two cache lines back and forth not one. It seems that the
> important locks are not split that way in CVS tip, because the gain
> wasn't much, but I wonder whether some effect like this might explain
> some of the unexplainable performance changes we've noticed in the past
> (eg, in the dbt2 results). A seemingly unrelated small change in the
> size of other data structures in shared memory might move things around
> enough to make a performance-critical lock cross a cache line boundary.

What about padding the LWLock to 64 bytes on these architectures. Both P4
and Opteron have 64 byte cache lines, IIRC. This would ensure that a
cacheline doesn't hold two LWLocks.

Gavin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc: Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-16 00:05:53
Message-ID: 18351.1126829153@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> What about padding the LWLock to 64 bytes on these architectures. Both P4
> and Opteron have 64 byte cache lines, IIRC. This would ensure that a
> cacheline doesn't hold two LWLocks.

I tried that first, actually, but it was a net loss. I guess enlarging
the array that much wastes too much cache space.

regards, tom lane


From: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-16 00:53:40
Message-ID: Pine.LNX.4.58.0509161049110.22373@linuxworld.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 15 Sep 2005, Tom Lane wrote:

> Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> > What about padding the LWLock to 64 bytes on these architectures. Both P4
> > and Opteron have 64 byte cache lines, IIRC. This would ensure that a
> > cacheline doesn't hold two LWLocks.
>
> I tried that first, actually, but it was a net loss. I guess enlarging
> the array that much wastes too much cache space.

Interesting. On Xeon (2 phys, 4 log), with LWLock padded to 64 bytes and
the cmpb/jump removed I get:

[swm(at)backup pgsqlpad]$ for i in 1 2 4; do time ./nrun.sh $i; done

real 0m54.362s
user 0m0.003s
sys 0m0.009s

real 1m9.788s
user 0m0.011s
sys 0m0.013s

real 2m8.870s
user 0m0.016s
sys 0m0.028s
[swm(at)backup pgsqlpad]$ for i in 1 2 4; do time ./nrun.sh $i; done

real 0m55.544s
user 0m0.006s
sys 0m0.007s

real 1m9.313s
user 0m0.007s
sys 0m0.018s

real 2m1.769s
user 0m0.017s
sys 0m0.027s

This compares to the following, which is unpadded but has cmpb/jump
removed but is otherwise vanilla:

1: 55: 2: 111: 4: 207

The decrease is small, but it's there.

Gavin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: gmaxwell(at)gmail(dot)com
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-16 00:55:28
Message-ID: 18653.1126832128@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Maxwell <gmaxwell(at)gmail(dot)com> writes:
> might be useful to align the structure so it always crosses two lines
> and measure the performance difference.. the delta could be basically
> attributed to the cache line bouncing since even one additional bounce
> would overwhelm the other performance effects from the changed
> alignment.

Good idea. I goosed the struct declaration and setup code to arrange
that the BufMappingLock's spinlock and the rest of its data were in
different cache lines instead of the same one. The results (still
on Red Hat's 4-way Opteron):

previous best code (slock-no-cmpb and spin-delay-2):
1 31s 2 42s 4 51s 8 100s
with LWLock padded to 32 bytes and correctly aligned:
1 31s 2 41s 4 51s 8 97s
with LWLocks 32 bytes, but deliberately misaligned:
1 30s 2 50s 4 102s 8 200s

There is no other reason than having to touch multiple cache lines for
the second and third cases to be different: the array indexing code
should be exactly the same.

These last numbers are pretty close to what I got from the
separated-spinlock patch:
1 31s 2 52s 4 106s 8 213s
So it seems there's no doubt that it's the doubled cache traffic that
was causing most of the problem there.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc: Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-16 02:40:38
Message-ID: 25941.1126838438@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> Interesting. On Xeon (2 phys, 4 log), with LWLock padded to 64 bytes and
> the cmpb/jump removed I get:
> [ 1 55s 2 69s 4 128s ]
> This compares to the following, which is unpadded but has cmpb/jump
> removed but is otherwise vanilla:
> 1: 55: 2: 111: 4: 207

Hmm, that's pretty significant. I tried it on a Xeon EM64T (thanks to
Stephen Frost for providing access), also 2 phys 4 log, and get:

Yesterday's CVS tip:
1 32s 2 46s 4 88s 8 168s
plus no-cmpb and spindelay2:
1 32s 2 48s 4 100s 8 177s
plus just-committed code to pad LWLock to 32:
1 33s 2 50s 4 98s 8 179s
alter to pad to 64:
1 33s 2 38s 4 108s 8 180s

I don't know what to make of the 2-process time going down while
4-process goes up; that seems just weird. But both numbers are
repeatable.

regards, tom lane


From: Gregory Maxwell <gmaxwell(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-16 03:48:09
Message-ID: e692861c05091520486eb9307c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 9/15/05, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Yesterday's CVS tip:
> 1 32s 2 46s 4 88s 8 168s
> plus no-cmpb and spindelay2:
> 1 32s 2 48s 4 100s 8 177s
> plus just-committed code to pad LWLock to 32:
> 1 33s 2 50s 4 98s 8 179s
> alter to pad to 64:
> 1 33s 2 38s 4 108s 8 180s
>
> I don't know what to make of the 2-process time going down while
> 4-process goes up; that seems just weird. But both numbers are
> repeatable.

It is odd.

In the two process case there is, assuming random behavior, a 1/2
chance that you've already got the right line, but in the 4 process
case only a 1/4 chance (since we're on a 4 way box). This would
explain why we don't see as much cost in the intentionally misaligned
case. You'd expect the a similar pattern of improvement with the
64byte alignment (some in the two process case, but more in the 4
case), but here we see more improvement in the two way case.

If I had to guess I might say that the 64byte alignment is removing
much of the unneeded line bouncing in the the two process case but is
at the same time creating more risk of bouncing caused by aliasing.
Since two processes have 1/2 chance the aliasing isn't a problem so
the change is a win, but in the four process case it's no longer a win
because with aliasing there is still a lot of fighting over the cache
lines even if you pack well, and the decrease in packing makes odd
aliasing somewhat more likely. This might also explain why the
misaligned case performed so poorly in the 4process case, since the
misalignment didn't just increase the cost 2x, it also increased the
likelihood of a bogus bounce due to aliasing..

If this is the case, then it may be possible through very careful
memory alignment to make sure that no two high contention locks that
are likely to be contended at once share the same line (through either
aliasing or through being directly within the same line).

Then again I could be completely wrong, my understanding of
multiprocessor cache coherency is very limited, and I have no clue how
cache aliasing fits into it... So the above is just uninformed
conjecture.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: gmaxwell(at)gmail(dot)com
Cc: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-16 04:18:17
Message-ID: 26652.1126844297@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Maxwell <gmaxwell(at)gmail(dot)com> writes:
> If I had to guess I might say that the 64byte alignment is removing
> much of the unneeded line bouncing in the the two process case but is
> at the same time creating more risk of bouncing caused by aliasing.

It's an idea ... not sure if it's right or not.

One thing I have noticed both on Xeon HT (2-physical-4-logical) and
on Opteron (4 real processors) is that the 2-process times are
significantly more variable than the 1, 4, or 8-process times: repeating
the measurements shows variance around the 10% level for 2 processes
but only around 1% for the others.

What I think this means is that the kernel is scheduling the 2 processes
onto 2 processors chosen-at-random, without awareness of whether those
two processors are on the same chip (in the Xeon case) or have closer
NUMA affinity (in the Opteron case). The other test cases are all
symmetric and there's no way for the kernel to blow it too badly, but
in the 2-process case it will be very visible whether the kernel
understands the hardware or not. And the impression I have (which
might be out of date) is that current Linux kernel releases are not
very bright about these things.

How does that tie into the point at hand about different results for
64-byte vs 32-byte LWLocks? Not sure, but I feel it's related somehow.

Anyway, it'd be interesting to get some test results for these things
on SMP machines running non-Linux kernels. Also, we ought to be more
rigorous about reporting exactly which kernel we are using for any
particular test.

regards, tom lane


From: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: gmaxwell(at)gmail(dot)com, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-16 07:05:46
Message-ID: Pine.LNX.4.58.0509161441020.23762@linuxworld.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I thought I'd throw SPARC into the equation (SPARC IIIi, in a dual SunFire
v250):

vanilla HEAD from ~1 week ago:

bash-3.00$ for i in 1 2 4; do time ./nrun.sh $i; done

real 1m49.037s
user 0m0.008s
sys 0m0.016s

real 2m3.482s
user 0m0.014s
sys 0m0.026s

real 3m54.215s
user 0m0.028s
sys 0m0.046s

With the spin-delay patch:

real 1m50.791s
user 0m0.008s
sys 0m0.016s

real 2m0.553s
user 0m0.015s
sys 0m0.026s

real 3m58.185s
user 0m0.027s
sys 0m0.048s

Padding the LWLock to 64 bytes:

bash-3.00$ for i in 1 2 4; do time ./nrun.sh $i; done

real 1m49.594s
user 0m0.008s
sys 0m0.016s

real 1m59.141s
user 0m0.015s
sys 0m0.026s

real 3m56.236s
user 0m0.027s
sys 0m0.047s

Padded to 128 bytes (I'm not sure about the cache line size):

real 1m50.498s
user 0m0.008s
sys 0m0.016s

real 2m0.169s
user 0m0.015s
sys 0m0.026s

real 3m56.891s
user 0m0.027s
sys 0m0.048s

----

These details seem suspicious to me when compared to those we're seeing on
x86 platforms. The reason is that padding the LWLock out to 64 or 128
bytes makes no difference (maybe the cache line is bigger?). The good
thing is, the difference between 1 and 2 instances of the test is small --
precisely what we want.

Thanks,

Gavin


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Marko Kreen" <marko(at)l-t(dot)ee>, <pgsql-hackers(at)postgresql(dot)org>, "Michael Paesold" <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-16 09:00:50
Message-ID: KGEFLMPJFBNNLNOOOPLGMEPOCIAA.simon@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Tom Lane wrote
> I'm going to go ahead and make that change, since it doesn't
> seem likely
> to have any downside. It might be interesting to look into forcing
> proper alignment of the shared buffer headers as well.

Just catching up on your mails - all of that sounds good so far.

Everything mentioned so far talks about spinlocks in the general sense,
rather than with respect to particular locks.

The different lock types we have are held for different amounts of time
and are also subject to differing amounts of contention. I think it
would be useful to put in a capability to tune each class of lock
according to the possibility for delay and contention on it.

Long delay, low contention e.g. CheckpointLock => waiter should sleep
immediately
Medium delay, high contention e.g. WALWriteLock => waiter spins if at
head of queue, else sleeps immediately
Short delay, high contention e.g. BufMappingLock => waiter spins and
retries forever, cos the lock is coming soon
Short delay, low contention, sometimes long waits at end of VACUUM
e.g. FreeSpaceLock => waiter spins, then eventually sleeps

I'm not sure whether you'll agree with my characterisation of these
locks, but the main thing I'm trying to get across is that
once-size-fits-all isn't the optimal approach. I'm not saying either
that we end up with individual characterisations for each lock type, but
a few key ones could be catered for differently.

Best Regards, Simon Riggs


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, gmaxwell(at)gmail(dot)com, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Marko Kreen <marko(at)l-t(dot)ee>, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-16 16:40:25
Message-ID: 200509160940.26337.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

> What I think this means is that the kernel is scheduling the 2 processes
> onto 2 processors chosen-at-random, without awareness of whether those
> two processors are on the same chip (in the Xeon case) or have closer
> NUMA affinity (in the Opteron case).

That would be consistent with my experience with HT, and the reason why many
software vendors recommend disabling it. Not sure about NUMA.

--
Josh Berkus
Aglio Database Solutions
San Francisco


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, gmaxwell(at)gmail(dot)com, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Marko Kreen <marko(at)l-t(dot)ee>, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-16 18:35:03
Message-ID: 87br2ssvxk.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:

> Tom,
>
> > What I think this means is that the kernel is scheduling the 2 processes
> > onto 2 processors chosen-at-random, without awareness of whether those
> > two processors are on the same chip (in the Xeon case) or have closer
> > NUMA affinity (in the Opteron case).
>
> That would be consistent with my experience with HT, and the reason why many
> software vendors recommend disabling it. Not sure about NUMA.

What version of linux was this test with? What you describe was certainly well
known for Linux 2.4 and earlier. It was well on its way to becoming
established dogma cargo-cult style.

However I was under the impression that 2.6 had moved beyond that problem.
It would be very interesting to know if 2.6 still suffers from this.

Also, those who recommend disabling HT on this basis should realize it didn't
penalize the >2 process case the same way. You may be getting a benefit only
in the cases where your system isn't heavily loaded. Ie, only when it doesn't
really matter.

--
greg


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, gmaxwell(at)gmail(dot)com, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Marko Kreen <marko(at)l-t(dot)ee>, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-16 19:36:29
Message-ID: 20050916193629.GZ6026@ns.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Greg Stark (gsstark(at)mit(dot)edu) wrote:
> However I was under the impression that 2.6 had moved beyond that problem.
> It would be very interesting to know if 2.6 still suffers from this.

The tests on the em64t at my place were using 2.6.12. I had thought 2.6
was better about this too, but I don't have another explanation for it.
CONFIG_SMT was enabled, etc..

Thanks,

Stephen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, gmaxwell(at)gmail(dot)com, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Marko Kreen <marko(at)l-t(dot)ee>, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-17 04:31:41
Message-ID: 4672.1126931501@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Greg Stark (gsstark(at)mit(dot)edu) wrote:
>> However I was under the impression that 2.6 had moved beyond that problem.
>> It would be very interesting to know if 2.6 still suffers from this.

> The tests on the em64t at my place were using 2.6.12. I had thought 2.6
> was better about this too, but I don't have another explanation for it.

The 4-way Opteron I've been using at Red Hat is running
2.6.12-1.1398_FC4smp (Fedora Core 4 obviously). Red Hat in particular
has been working hard in this area, and I thought that their recent
kernels included NUMA fixes that weren't yet accepted upstream (at least
not in the stable kernel branches). But it seems there's still a ways
to go yet.

It'd be real interesting to see comparable numbers from some non-Linux
kernels, particularly commercial systems like Solaris.

regards, tom lane


From: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, gmaxwell(at)gmail(dot)com, Marko Kreen <marko(at)l-t(dot)ee>, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-17 04:41:12
Message-ID: Pine.LNX.4.58.0509171440520.32213@linuxworld.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 17 Sep 2005, Tom Lane wrote:

> Stephen Frost <sfrost(at)snowman(dot)net> writes:
> > * Greg Stark (gsstark(at)mit(dot)edu) wrote:
> >> However I was under the impression that 2.6 had moved beyond that problem.
> >> It would be very interesting to know if 2.6 still suffers from this.
>
> > The tests on the em64t at my place were using 2.6.12. I had thought 2.6
> > was better about this too, but I don't have another explanation for it.
>
> The 4-way Opteron I've been using at Red Hat is running
> 2.6.12-1.1398_FC4smp (Fedora Core 4 obviously). Red Hat in particular
> has been working hard in this area, and I thought that their recent
> kernels included NUMA fixes that weren't yet accepted upstream (at least
> not in the stable kernel branches). But it seems there's still a ways
> to go yet.
>
> It'd be real interesting to see comparable numbers from some non-Linux
> kernels, particularly commercial systems like Solaris.

Did you see the Solaris results I posted?

Gavin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, gmaxwell(at)gmail(dot)com, Marko Kreen <marko(at)l-t(dot)ee>, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-17 05:40:28
Message-ID: 5220.1126935628@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> On Sat, 17 Sep 2005, Tom Lane wrote:
>> It'd be real interesting to see comparable numbers from some non-Linux
>> kernels, particularly commercial systems like Solaris.

> Did you see the Solaris results I posted?

Are you speaking of
http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php
?

That doesn't seem directly relevant to the point, because it's for a
2-CPU machine; so there's no way to run a test case that uses more than
one but less than all the processors. In either the "one" or "all"
cases, performance ought to be pretty stable regardless of whether the
kernel understands about any processor asymmetries that may exist in
the hardware. Not to mention that I don't know of any asymmetries in
a dual SPARC anyway. We really need to test this on comparable
hardware, which I guess means we need Solaris/x86 on something with
hyperthreading or known NUMA asymmetry.

regards, tom lane


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Stephen Frost <sfrost(at)snowman(dot)net>, Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, gmaxwell(at)gmail(dot)com, Marko Kreen <marko(at)l-t(dot)ee>, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-18 16:33:54
Message-ID: 20050918163354.GW7630@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 17, 2005 at 01:40:28AM -0400, Tom Lane wrote:
> Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> > On Sat, 17 Sep 2005, Tom Lane wrote:
> >> It'd be real interesting to see comparable numbers from some non-Linux
> >> kernels, particularly commercial systems like Solaris.
>
> > Did you see the Solaris results I posted?
>
> Are you speaking of
> http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php
> ?
>
> That doesn't seem directly relevant to the point, because it's for a
> 2-CPU machine; so there's no way to run a test case that uses more than
> one but less than all the processors. In either the "one" or "all"
> cases, performance ought to be pretty stable regardless of whether the
> kernel understands about any processor asymmetries that may exist in
> the hardware. Not to mention that I don't know of any asymmetries in
> a dual SPARC anyway. We really need to test this on comparable
> hardware, which I guess means we need Solaris/x86 on something with
> hyperthreading or known NUMA asymmetry.

I have access to a 4-way Opteron 852 running Solaris 10. What patches
would you like me to test?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-18 21:39:05
Message-ID: Pine.LNX.4.58.0509190737580.2692@linuxworld.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 18 Sep 2005, Jim C. Nasby wrote:

> On Sat, Sep 17, 2005 at 01:40:28AM -0400, Tom Lane wrote:
> > Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> > > On Sat, 17 Sep 2005, Tom Lane wrote:
> > >> It'd be real interesting to see comparable numbers from some non-Linux
> > >> kernels, particularly commercial systems like Solaris.
> >
> > > Did you see the Solaris results I posted?
> >
> > Are you speaking of
> > http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php
> > ?
> >
> > That doesn't seem directly relevant to the point, because it's for a
> > 2-CPU machine; so there's no way to run a test case that uses more than
> > one but less than all the processors. In either the "one" or "all"
> > cases, performance ought to be pretty stable regardless of whether the
> > kernel understands about any processor asymmetries that may exist in
> > the hardware. Not to mention that I don't know of any asymmetries in
> > a dual SPARC anyway. We really need to test this on comparable
> > hardware, which I guess means we need Solaris/x86 on something with
> > hyperthreading or known NUMA asymmetry.
>
> I have access to a 4-way Opteron 852 running Solaris 10. What patches
> would you like me to test?

These ones here:

http://archives.postgresql.org/pgsql-hackers/2005-09/msg00566.php

Gavin


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-19 22:35:42
Message-ID: 200509192235.j8JMZgs22363@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


I have removed this TODO item:

* Research use of sched_yield() for spinlock acquisition failure

---------------------------------------------------------------------------

Tom Lane wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > Marko Kreen <marko(at)l-t(dot)ee> writes:
> > (I speculate that it's set up to only yield the processor to other
> > processes already affiliated to that processor. In any case, it
> > is definitely capable of getting through 10000 yields without
> > running the guy who's holding the spinlock.)
>
> > Maybe it should try sched_yield once and then use select after that?
>
> I tried that, actually, but it didn't seem to offer any particular
> benefit. (Maybe it would have helped more on older Linux kernels before
> they changed sched_yield?)
>
> I'm feeling even more disenchanted with sched_yield now that Marko
> pointed out that the behavior was changed recently. Here we have a
> construct that is not portable cross-platform, does not act as
> documented in its man page, and the kernel guys feel free to whack its
> behavior around in major ways without documenting that either. It seems
> to be a crap-shoot whether it will be useful on any particular machine
> or not. At least with the select() code we can be reasonably confident
> we know what will happen.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-23 02:26:34
Message-ID: 200509230226.j8N2QYH21927@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Is there a TODO here?

---------------------------------------------------------------------------

Tom Lane wrote:
> The test case I just posted shows that our spinlock code, which we had
> thought largely done, is once again becoming a performance bottleneck.
> It's time to resurrect some of the ideas we kicked around in early
> 2002, and didn't pursue because we decided spinlocks weren't our major
> performance problem.
>
> I started investigating this by trying to understand why the "futex
> spinlock" patch had helped the Red Hat customer I mentioned, when it
> hadn't done anything much in last year's experiments. It turns out
> that this customer was using an 8-way Opteron, and the futex patch
> helps a lot more on that architecture than others. Some random
> experimentation eventually turned up part of the reason: the x86 TAS
> assembly code we are using is pessimal for x86_64. Specifically,
> s_lock.h uses this for both architectures:
>
> /* Use a non-locking test before asserting the bus lock */
> __asm__ __volatile__(
> " cmpb $0,%1 \n"
> " jne 1f \n"
> " lock \n"
> " xchgb %0,%1 \n"
> "1: \n"
>
> and it turns out that deleting the cmpb and jne makes things go
> significantly faster on Opterons. So the "non-locking test" is
> not a win at all on x86_64. As best I can tell from testing,
> removing it wins because it causes the rate of spinlock delays
> to drop significantly. Why this should be is not completely
> obvious. I speculate that the Opterons are capable of pulling
> in a copy of the cache line that contains the spinlock and then
> running 100 iterations of the cmpb test without ever noticing
> that the processor holding the lock has now released it. Without
> the cmpb, they are forced by the locking xchgb test to actually
> look at the real state of the spinlock each time.
>
> I kinda suspect that the cmpb test is a no-op or loss on all
> Intelish processors: it can only be a win if you expect a lot
> of contention for the spin lock, but in percentage terms we still
> have a very low conflict rate, so in most executions of the TAS
> macro, the cmpb is just wasted cycles. Bottom line: we definitely
> don't want it for x86_64, and maybe not at all, but we need more
> research to decide the latter.
>
> The second reason that the futex patch is helping is that when
> a spinlock delay does occur, it allows the delaying process to be
> awoken almost immediately, rather than delaying 10 msec or more
> as the existing code does. However, given that we are only expecting
> the spinlock to be held for a couple dozen instructions, using the
> kernel futex mechanism is huge overkill --- the in-kernel overhead
> to manage the futex state is almost certainly several orders of
> magnitude more than the delay we actually want.
>
> I looked into several other methods of doing the spinlock delay
> instead. I think all of these were suggested at one point or
> another in our earlier discussions of spinlocks:
>
> 1. Use sched_yield() if available: it does just what we want,
> ie, yield the processor without forcing any useless time delay
> before we can be rescheduled. This doesn't exist everywhere
> but it exists in recent Linuxen, so I tried it. It made for a
> beautiful improvement in my test case numbers: CPU utilization
> went to 100% and the context swap rate to almost nil. Unfortunately,
> I also saw fairly frequent "stuck spinlock" panics when running
> more queries than there were processors --- this despite increasing
> NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield
> anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect.
> (I speculate that it's set up to only yield the processor to other
> processes already affiliated to that processor. In any case, it
> is definitely capable of getting through 10000 yields without
> running the guy who's holding the spinlock.)
>
> 2. Reduce the length of the select() delays in s_lock. The current code
> delays in quanta of 10msec, but on recent Linuxen (and I think other
> platforms too) the scheduling quantum is 1msec, so we can get the
> processor back faster if we ask for a 1msec delay. I tried this and it
> is definitely a win on Linux; it doesn't seem to hurt anything on older
> systems either, they just round the delay up to 10msec like before.
> So I think we should do this, even though it's only a partial answer.
>
> 3. Modify the spin loop to do a little more work between TAS() tests.
> In the existing code there are only about half a dozen instructions
> total in the normal spin loop, most of them very fast, with the result
> that the spinning processor does its best to monopolize the system bus
> with locked probe instructions. This is obviously not good, as it'll
> interfere with the ability of the spinlock holder to complete its work
> and release the lock. (The bulk of the spinlock uses are for LWLocks,
> and with the current data structures the LWLock's own state is usually
> going to be in the same cache line as the spinlock proper, so that
> cache grabs for the spinlock hurt the LWLock state updating as well
> as the eventual spinlock release.) I modified the code to use an
> integer modulo operation as part of the test for SPINS_PER_DELAY;
> on most processors this should keep the processor's attention and
> keep it away from the system bus for several times longer than an
> average instruction. (It's relevant here that the "rep;nop" used
> to slow the spin loop on recent Intel processors appears to be
> completely ignored by Opterons. Basically we need a non-hardware-
> specific trick to slow the loop.)
>
> 4. Modify SPINS_PER_DELAY. Looping more times instead of entering the
> kernel is a win for SMP boxes: whenever you give up and call the kernel
> to delay, it's likely that the spinlock holder will have released the
> lock before you even get all the way into the kernel, let alone get
> rescheduled again. I found that pushing SPINS_PER_DELAY above 200
> caused the select delay rate to fall to almost nil in this test case.
> However, that would be counterproductive for uniprocessors where the
> optimal strategy is normally to give up the processor immediately.
> We've talked about making the spin count user configurable, but I never
> cared for that answer. Instead, I've worked up some code to
> automatically adjust the spin count depending on whether we seem to be
> in a uniprocessor or multiprocessor. The idea is to decrease the spin
> count (down to a lower limit) whenever s_lock has to delay in order to
> get a lock, and to increase the count (up to a limit) whenever we see
> the lock come free before we've had to delay. The former condition
> suggests that we are in a uniprocessor and the latter suggests we are in
> a multiprocessor. I recall having complained that this idea would
> probably be unstable, but in testing it seems to do pretty nicely at
> converging to the minimum spin count on a 1-CPU machine and the maximum
> limit on an SMP box. It could do with more testing though to see if it
> works the same for other people.
>
> My proposal therefore is to do #2, #3, and #4, and to modify the TAS
> assembly code at least on x86_64. Together, these changes bring
> my test case on a 4-way Opteron from
>
> N, runtime: 1 36s 2 61s 4 105s 8 198s
>
> to
>
> N, runtime: 1 35s 2 48s 4 58s 8 105s
>
> At 4 queries, CPU utilization is 100% and the context swap rate nearly
> nil, which seems to mean we've gained another (temporary no doubt)
> victory over the spinlock problem.
>
> I attach two proposed patches: the first removes the cmpb/jne from
> the x86 TAS assembly code, and the second one does the s_lock changes
> enumerated as points #2, #3, #4. The first one in particular needs
> more testing to see if it hurts performance on any non-Opteron x86
> chips. (If so, we'd just make it conditional to x86_64.)
>
> Comments and testing invited.
>
> regards, tom lane
>

Content-Description: slock-no-cmpb.patch

> *** REL8_0/src/include/storage/s_lock.h Sun Aug 28 20:41:44 2005
> --- new/s_lock.h Fri Sep 9 14:58:44 2005
> ***************
> *** 120,132 ****
> {
> register slock_t _res = 1;
>
> ! /* Use a non-locking test before asserting the bus lock */
> __asm__ __volatile__(
> - " cmpb $0,%1 \n"
> - " jne 1f \n"
> - " lock \n"
> " xchgb %0,%1 \n"
> - "1: \n"
> : "+q"(_res), "+m"(*lock)
> :
> : "memory", "cc");
> --- 120,128 ----
> {
> register slock_t _res = 1;
>
> ! /* xchg implies a LOCK prefix, so no need to say LOCK explicitly */
> __asm__ __volatile__(
> " xchgb %0,%1 \n"
> : "+q"(_res), "+m"(*lock)
> :
> : "memory", "cc");

Content-Description: spin-delay.patch

> *** /home/postgres/pgsql/src/backend/storage/lmgr/proc.c.orig Sat Aug 20 19:26:24 2005
> --- /home/postgres/pgsql/src/backend/storage/lmgr/proc.c Sun Sep 11 15:07:20 2005
> ***************
> *** 170,175 ****
> --- 170,177 ----
>
> ProcGlobal->freeProcs = INVALID_OFFSET;
>
> + ProcGlobal->spins_per_delay = DEFAULT_SPINS_PER_DELAY;
> +
> /*
> * Pre-create the PGPROC structures and create a semaphore for
> * each.
> ***************
> *** 224,232 ****
> --- 226,239 ----
> /*
> * Try to get a proc struct from the free list. If this fails, we
> * must be out of PGPROC structures (not to mention semaphores).
> + *
> + * While we are holding the ProcStructLock, also copy the current
> + * shared estimate of spins_per_delay to local storage.
> */
> SpinLockAcquire(ProcStructLock);
>
> + set_spins_per_delay(procglobal->spins_per_delay);
> +
> myOffset = procglobal->freeProcs;
>
> if (myOffset != INVALID_OFFSET)
> ***************
> *** 318,338 ****
>
> Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS);
>
> dummyproc = &DummyProcs[proctype];
>
> /*
> * dummyproc should not presently be in use by anyone else
> */
> if (dummyproc->pid != 0)
> elog(FATAL, "DummyProc[%d] is in use by PID %d",
> proctype, dummyproc->pid);
> MyProc = dummyproc;
>
> /*
> * Initialize all fields of MyProc, except MyProc->sem which was set
> * up by InitProcGlobal.
> */
> - MyProc->pid = MyProcPid; /* marks dummy proc as in use by me */
> SHMQueueElemInit(&(MyProc->links));
> MyProc->waitStatus = STATUS_OK;
> MyProc->xid = InvalidTransactionId;
> --- 325,362 ----
>
> Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS);
>
> + /*
> + * Just for paranoia's sake, we use the ProcStructLock to protect
> + * assignment and releasing of DummyProcs entries.
> + *
> + * While we are holding the ProcStructLock, also copy the current
> + * shared estimate of spins_per_delay to local storage.
> + */
> + SpinLockAcquire(ProcStructLock);
> +
> + set_spins_per_delay(ProcGlobal->spins_per_delay);
> +
> dummyproc = &DummyProcs[proctype];
>
> /*
> * dummyproc should not presently be in use by anyone else
> */
> if (dummyproc->pid != 0)
> + {
> + SpinLockRelease(ProcStructLock);
> elog(FATAL, "DummyProc[%d] is in use by PID %d",
> proctype, dummyproc->pid);
> + }
> MyProc = dummyproc;
>
> + MyProc->pid = MyProcPid; /* marks dummy proc as in use by me */
> +
> + SpinLockRelease(ProcStructLock);
> +
> /*
> * Initialize all fields of MyProc, except MyProc->sem which was set
> * up by InitProcGlobal.
> */
> SHMQueueElemInit(&(MyProc->links));
> MyProc->waitStatus = STATUS_OK;
> MyProc->xid = InvalidTransactionId;
> ***************
> *** 509,514 ****
> --- 533,541 ----
> /* PGPROC struct isn't mine anymore */
> MyProc = NULL;
>
> + /* Update shared estimate of spins_per_delay */
> + procglobal->spins_per_delay = update_spins_per_delay(procglobal->spins_per_delay);
> +
> SpinLockRelease(ProcStructLock);
> }
>
> ***************
> *** 532,542 ****
> --- 559,576 ----
> /* Release any LW locks I am holding (see notes above) */
> LWLockReleaseAll();
>
> + SpinLockAcquire(ProcStructLock);
> +
> /* Mark dummy proc no longer in use */
> MyProc->pid = 0;
>
> /* PGPROC struct isn't mine anymore */
> MyProc = NULL;
> +
> + /* Update shared estimate of spins_per_delay */
> + ProcGlobal->spins_per_delay = update_spins_per_delay(ProcGlobal->spins_per_delay);
> +
> + SpinLockRelease(ProcStructLock);
> }
>
>
> *** /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c.orig Fri Aug 26 10:47:35 2005
> --- /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c Sun Sep 11 17:43:20 2005
> ***************
> *** 21,26 ****
> --- 21,30 ----
> #include "storage/s_lock.h"
> #include "miscadmin.h"
>
> +
> + static int spins_per_delay = DEFAULT_SPINS_PER_DELAY;
> +
> +
> /*
> * s_lock_stuck() - complain about a stuck spinlock
> */
> ***************
> *** 49,102 ****
> * We loop tightly for awhile, then delay using pg_usleep() and try
> * again. Preferably, "awhile" should be a small multiple of the
> * maximum time we expect a spinlock to be held. 100 iterations seems
> ! * about right. In most multi-CPU scenarios, the spinlock is probably
> ! * held by a process on another CPU and will be released before we
> ! * finish 100 iterations. However, on a uniprocessor, the tight loop
> ! * is just a waste of cycles, so don't iterate thousands of times.
> *
> * Once we do decide to block, we use randomly increasing pg_usleep()
> ! * delays. The first delay is 10 msec, then the delay randomly
> ! * increases to about one second, after which we reset to 10 msec and
> * start again. The idea here is that in the presence of heavy
> * contention we need to increase the delay, else the spinlock holder
> * may never get to run and release the lock. (Consider situation
> * where spinlock holder has been nice'd down in priority by the
> * scheduler --- it will not get scheduled until all would-be
> ! * acquirers are sleeping, so if we always use a 10-msec sleep, there
> * is a real possibility of starvation.) But we can't just clamp the
> * delay to an upper bound, else it would take a long time to make a
> * reasonable number of tries.
> *
> * We time out and declare error after NUM_DELAYS delays (thus, exactly
> * that many tries). With the given settings, this will usually take
> ! * 3 or so minutes. It seems better to fix the total number of tries
> * (and thus the probability of unintended failure) than to fix the
> * total time spent.
> *
> ! * The pg_usleep() delays are measured in centiseconds (0.01 sec) because
> ! * 10 msec is a common resolution limit at the OS level.
> */
> ! #define SPINS_PER_DELAY 100
> #define NUM_DELAYS 1000
> ! #define MIN_DELAY_CSEC 1
> ! #define MAX_DELAY_CSEC 100
>
> ! int spins = 0;
> int delays = 0;
> ! int cur_delay = MIN_DELAY_CSEC;
>
> while (TAS(lock))
> {
> /* CPU-specific delay each time through the loop */
> SPIN_DELAY();
>
> ! /* Block the process every SPINS_PER_DELAY tries */
> ! if (++spins > SPINS_PER_DELAY)
> {
> if (++delays > NUM_DELAYS)
> s_lock_stuck(lock, file, line);
>
> ! pg_usleep(cur_delay * 10000L);
>
> #if defined(S_LOCK_TEST)
> fprintf(stdout, "*");
> --- 53,121 ----
> * We loop tightly for awhile, then delay using pg_usleep() and try
> * again. Preferably, "awhile" should be a small multiple of the
> * maximum time we expect a spinlock to be held. 100 iterations seems
> ! * about right as an initial guess. However, on a uniprocessor the
> ! * loop is a waste of cycles, while in a multi-CPU scenario it's usually
> ! * better to spin a bit longer than to call the kernel, so we try to
> ! * adapt the spin loop count depending on whether we seem to be in
> ! * a uniprocessor or multiprocessor.
> *
> * Once we do decide to block, we use randomly increasing pg_usleep()
> ! * delays. The first delay is 1 msec, then the delay randomly
> ! * increases to about one second, after which we reset to 1 msec and
> * start again. The idea here is that in the presence of heavy
> * contention we need to increase the delay, else the spinlock holder
> * may never get to run and release the lock. (Consider situation
> * where spinlock holder has been nice'd down in priority by the
> * scheduler --- it will not get scheduled until all would-be
> ! * acquirers are sleeping, so if we always use a 1-msec sleep, there
> * is a real possibility of starvation.) But we can't just clamp the
> * delay to an upper bound, else it would take a long time to make a
> * reasonable number of tries.
> *
> * We time out and declare error after NUM_DELAYS delays (thus, exactly
> * that many tries). With the given settings, this will usually take
> ! * 2 or so minutes. It seems better to fix the total number of tries
> * (and thus the probability of unintended failure) than to fix the
> * total time spent.
> *
> ! * The pg_usleep() delays are measured in milliseconds because 1 msec
> ! * is a common resolution limit at the OS level for newer platforms.
> ! * On older platforms the resolution limit is usually 10 msec, in
> ! * which case the total delay before timeout will be a bit more.
> */
> ! #define MIN_SPINS_PER_DELAY 10
> ! #define MAX_SPINS_PER_DELAY 1000
> #define NUM_DELAYS 1000
> ! #define MIN_DELAY_MSEC 1
> ! #define MAX_DELAY_MSEC 1000
>
> ! int spins = spins_per_delay;
> int delays = 0;
> ! int cur_delay = 0;
>
> while (TAS(lock))
> {
> /* CPU-specific delay each time through the loop */
> SPIN_DELAY();
>
> ! /*
> ! * Block the process every spins_per_delay tries.
> ! *
> ! * What we are really testing for here is spins being decremented
> ! * to zero. We insert an unnecessary integer modulo operation
> ! * into the test because we'd like this loop to run longer than
> ! * just two or three instructions: ideally, the processor should
> ! * not be contending for the system bus for a little while here.
> ! */
> ! if ((--spins % MAX_SPINS_PER_DELAY) == 0)
> {
> if (++delays > NUM_DELAYS)
> s_lock_stuck(lock, file, line);
>
> ! if (cur_delay == 0) /* first time to delay? */
> ! cur_delay = MIN_DELAY_MSEC;
> !
> ! pg_usleep(cur_delay * 1000L);
>
> #if defined(S_LOCK_TEST)
> fprintf(stdout, "*");
> ***************
> *** 107,119 ****
> cur_delay += (int) (cur_delay *
> (((double) random()) / ((double) MAX_RANDOM_VALUE)) + 0.5);
> /* wrap back to minimum delay when max is exceeded */
> ! if (cur_delay > MAX_DELAY_CSEC)
> ! cur_delay = MIN_DELAY_CSEC;
>
> ! spins = 0;
> }
> }
> }
>
> /*
> * Various TAS implementations that cannot live in s_lock.h as no inline
> --- 126,200 ----
> cur_delay += (int) (cur_delay *
> (((double) random()) / ((double) MAX_RANDOM_VALUE)) + 0.5);
> /* wrap back to minimum delay when max is exceeded */
> ! if (cur_delay > MAX_DELAY_MSEC)
> ! cur_delay = MIN_DELAY_MSEC;
>
> ! spins = spins_per_delay;
> }
> }
> +
> + /*
> + * If we were able to acquire the lock without delaying, it's a good
> + * indication we are in a multiprocessor. If we had to delay, it's
> + * a sign (but not a sure thing) that we are in a uniprocessor.
> + * Hence, we decrement spins_per_delay slowly when we had to delay,
> + * and increase it rapidly when we didn't. It's expected that
> + * spins_per_delay will converge to the minimum value on a uniprocessor
> + * and to the maximum value on a multiprocessor.
> + *
> + * Note: spins_per_delay is local within our current process.
> + * We want to average these observations across multiple backends,
> + * since it's relatively rare for this function to even get entered,
> + * and so a single backend might not live long enough to converge on
> + * a good value. That is handled by the two routines below.
> + */
> + if (cur_delay == 0)
> + {
> + /* we never had to delay */
> + spins_per_delay += 100;
> + spins_per_delay = Min(spins_per_delay, MAX_SPINS_PER_DELAY);
> + }
> + else
> + {
> + spins_per_delay -= 1;
> + spins_per_delay = Max(spins_per_delay, MIN_SPINS_PER_DELAY);
> + }
> + }
> +
> +
> + /*
> + * Set local copy of spins_per_delay during backend startup.
> + *
> + * NB: this has to be pretty fast as it is called while holding a spinlock
> + */
> + void
> + set_spins_per_delay(int shared_spins_per_delay)
> + {
> + spins_per_delay = shared_spins_per_delay;
> + }
> +
> + /*
> + * Update shared estimate of spins_per_delay during backend exit.
> + *
> + * NB: this has to be pretty fast as it is called while holding a spinlock
> + */
> + int
> + update_spins_per_delay(int shared_spins_per_delay)
> + {
> + /*
> + * We use an exponential moving average with a relatively slow
> + * adaption rate, so that noise in any one backend's result won't
> + * affect the shared value too much. As long as both inputs are
> + * within the allowed range, the result must be too, so we need not
> + * worry about clamping the result.
> + *
> + * We deliberately truncate rather than rounding; this is so that
> + * single adjustments inside a backend can affect the shared estimate
> + * (see the asymmetric adjustment rules above).
> + */
> + return (shared_spins_per_delay * 15 + spins_per_delay) / 16;
> }
> +
>
> /*
> * Various TAS implementations that cannot live in s_lock.h as no inline
> *** /home/postgres/pgsql/src/include/storage/proc.h.orig Sat Aug 20 19:26:34 2005
> --- /home/postgres/pgsql/src/include/storage/proc.h Sun Sep 11 15:07:09 2005
> ***************
> *** 105,110 ****
> --- 105,112 ----
> {
> /* Head of list of free PGPROC structures */
> SHMEM_OFFSET freeProcs;
> + /* Current shared estimate of appropriate spins_per_delay value */
> + int spins_per_delay;
> } PROC_HDR;
>
>
> *** /home/postgres/pgsql/src/include/storage/s_lock.h.orig Sun Aug 28 20:41:34 2005
> --- /home/postgres/pgsql/src/include/storage/s_lock.h Sun Sep 11 15:07:09 2005
> ***************
> *** 826,829 ****
> --- 826,835 ----
> */
> extern void s_lock(volatile slock_t *lock, const char *file, int line);
>
> + /* Support for dynamic adjustment of spins_per_delay */
> + #define DEFAULT_SPINS_PER_DELAY 100
> +
> + extern void set_spins_per_delay(int shared_spins_per_delay);
> + extern int update_spins_per_delay(int shared_spins_per_delay);
> +
> #endif /* S_LOCK_H */

>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-10-12 17:44:50
Message-ID: 1129139090.8300.568.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote:
> I wrote:
> > Another thought came to mind: maybe the current data layout for LWLocks
> > is bad. Right now, the spinlock that protects each LWLock data struct
> > is itself part of the struct, and since the structs aren't large (circa
> > 20 bytes), the whole thing is usually all in the same cache line. ...
> > Maybe it'd be better to allocate the spinlocks off by themselves.
>
> Well, this is odd. I made up a patch to do this (attached) and found
> that it pretty much sucks. Still the 4-way Opteron, previous best
> (slock-no-cmpb and spin-delay-2):
> 1 31s 2 42s 4 51s 8 100s
> with lwlock-separate added:
> 1 31s 2 52s 4 106s 8 213s
>
> What I had expected to see was a penalty in the single-thread case (due
> to instructions added to the LWLock functions), but there isn't any.
> I did not expect to see a factor-of-2 penalty for four threads.
>
> I guess what this means is that there's no real problem with losing the
> cache line while manipulating the LWLock, which is what the patch was
> intended to prevent. Instead, we're paying for swapping two cache lines
> (the spinlock and the LWLock) across processors instead of just one line.
> But that should at worst be a 2x inflation of the time previously spent
> in LWLockAcquire/Release, which is surely not yet all of the application
> ;-). Why the heck is this so bad?

(Just going back through the whole thread for completeness.)

> Should we expect that apparently
> minor changes in shared data structures might be costing equivalently
> huge penalties in SMP performance elsewhere?

Yes. That's the advice from Intel and AMD; but we should add that there
is potential for improving performance also.

The possible problem we were trying to avoid here was false sharing of
the cache line by lock requestors of locks whose spin locks were
adjacent in memory.

Splitting the data structure was just one of the possible ways of
avoiding that. The above shows that the possible solution described
above didn't work, but there could be others.

One thing we tried in February was padding out the statically defined
locks with dummy lock definitions in the enum. This has the effect of
ensuring that the most contested locks are very definitely in their own
cache line and not shared with others.
That showed a noticeable improvement in performance, probably because it
costs very little to implement, even if the code would require some
explanatory documentation.

The lwlock structure in include/storage/lwlock.h looks like

typedef enum LWLockId
{
BufMappingLock,
BufFreelistLock,
LockMgrLock,
OidGenLock,
XidGenLock,
ProcArrayLock,
SInvalLock,
FreeSpaceLock,
WALInsertLock,
WALWriteLock,
...

Notice that the heavily contested locks (i.e. first 3 and the WAL locks)
are adjacent to at least one other heavily contested lock. So they are
certain to be in the same cache line and therefore to cause false
sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's
optimization handbooks).

This could be replaced by...

typedef enum LWLockId
{
BufMappingLock,
BufMappingLock_Padding1,
BufMappingLock_Padding2,
BufMappingLock_Padding3,
BufMappingLock_Padding4,
BufMappingLock_Padding5,
BufMappingLock_Padding6,
BufMappingLock_Padding7,
BufFreelistLock,
BufFreelistLock_Padding1,
BufFreelistLock_Padding2,
BufFreelistLock_Padding3,
BufFreelistLock_Padding4,
BufFreelistLock_Padding5,
BufFreelistLock_Padding6,
BufFreelistLock_Padding7,
LockMgrLock,
LockMgrLock_Padding1,
LockMgrLock_Padding2,
LockMgrLock_Padding3,
LockMgrLock_Padding4,
LockMgrLock_Padding5,
LockMgrLock_Padding6,
LockMgrLock_Padding7,
OidGenLock,
XidGenLock,
ProcArrayLock,
SInvalLock,
FreeSpaceLock,
WALInsertLock,
WALInsertLock_Padding1,
WALInsertLock_Padding2,
WALInsertLock_Padding3,
WALInsertLock_Padding4,
WALInsertLock_Padding5,
WALInsertLock_Padding6,
WALInsertLock_Padding7,
WALWriteLock,
WALWriteLock_Padding1,
WALWriteLock_Padding2,
WALWriteLock_Padding3,
WALWriteLock_Padding4,
WALWriteLock_Padding5,
WALWriteLock_Padding6,
WALWriteLock_Padding7,
...

where the number of padding locks is determined by how many lock
structures fit within a 128 byte cache line.

This isn't exactly elegant coding, but it provides a useful improvement
on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
pretty darn stupid. But it does follow the CPU optimization handbook
advice and I did see a noticeable improvement in performance and a
reduction in context switching.

I'm not in a position to try this again now on 8.1beta, but I'd welcome
a performance test result from anybody that is. I'll supply a patch
against 8.1beta for anyone wanting to test this.

Best Regards, Simon Riggs


From: Emil Briggs <emil(at)baymountain(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-10-12 19:07:23
Message-ID: 200510121507.23413.emil@baymountain.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> where the number of padding locks is determined by how many lock
> structures fit within a 128 byte cache line.
>
> This isn't exactly elegant coding, but it provides a useful improvement
> on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> pretty darn stupid. But it does follow the CPU optimization handbook
> advice and I did see a noticeable improvement in performance and a
> reduction in context switching.
>
> I'm not in a position to try this again now on 8.1beta, but I'd welcome
> a performance test result from anybody that is. I'll supply a patch
> against 8.1beta for anyone wanting to test this.
>

I don't have an 8 way available right now but I can run tests on a 4 way
Opteron if that would be helpful.

Emil


From: Mark Wong <markw(at)osdl(dot)org>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-10-13 17:57:39
Message-ID: 200510131758.j9DHwi4s012517@smtp.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 12 Oct 2005 18:44:50 +0100
Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote:
> > I wrote:
> > > Another thought came to mind: maybe the current data layout for LWLocks
> > > is bad. Right now, the spinlock that protects each LWLock data struct
> > > is itself part of the struct, and since the structs aren't large (circa
> > > 20 bytes), the whole thing is usually all in the same cache line. ...
> > > Maybe it'd be better to allocate the spinlocks off by themselves.
> >
> > Well, this is odd. I made up a patch to do this (attached) and found
> > that it pretty much sucks. Still the 4-way Opteron, previous best
> > (slock-no-cmpb and spin-delay-2):
> > 1 31s 2 42s 4 51s 8 100s
> > with lwlock-separate added:
> > 1 31s 2 52s 4 106s 8 213s
> >
> > What I had expected to see was a penalty in the single-thread case (due
> > to instructions added to the LWLock functions), but there isn't any.
> > I did not expect to see a factor-of-2 penalty for four threads.
> >
> > I guess what this means is that there's no real problem with losing the
> > cache line while manipulating the LWLock, which is what the patch was
> > intended to prevent. Instead, we're paying for swapping two cache lines
> > (the spinlock and the LWLock) across processors instead of just one line.
> > But that should at worst be a 2x inflation of the time previously spent
> > in LWLockAcquire/Release, which is surely not yet all of the application
> > ;-). Why the heck is this so bad?
>
> (Just going back through the whole thread for completeness.)
>
> > Should we expect that apparently
> > minor changes in shared data structures might be costing equivalently
> > huge penalties in SMP performance elsewhere?
>
> Yes. That's the advice from Intel and AMD; but we should add that there
> is potential for improving performance also.
>
> The possible problem we were trying to avoid here was false sharing of
> the cache line by lock requestors of locks whose spin locks were
> adjacent in memory.
>
> Splitting the data structure was just one of the possible ways of
> avoiding that. The above shows that the possible solution described
> above didn't work, but there could be others.
>
> One thing we tried in February was padding out the statically defined
> locks with dummy lock definitions in the enum. This has the effect of
> ensuring that the most contested locks are very definitely in their own
> cache line and not shared with others.
> That showed a noticeable improvement in performance, probably because it
> costs very little to implement, even if the code would require some
> explanatory documentation.
>
> The lwlock structure in include/storage/lwlock.h looks like
>
> typedef enum LWLockId
> {
> BufMappingLock,
> BufFreelistLock,
> LockMgrLock,
> OidGenLock,
> XidGenLock,
> ProcArrayLock,
> SInvalLock,
> FreeSpaceLock,
> WALInsertLock,
> WALWriteLock,
> ...
>
> Notice that the heavily contested locks (i.e. first 3 and the WAL locks)
> are adjacent to at least one other heavily contested lock. So they are
> certain to be in the same cache line and therefore to cause false
> sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's
> optimization handbooks).
>
> This could be replaced by...
>
> typedef enum LWLockId
> {
> BufMappingLock,
> BufMappingLock_Padding1,
> BufMappingLock_Padding2,
> BufMappingLock_Padding3,
> BufMappingLock_Padding4,
> BufMappingLock_Padding5,
> BufMappingLock_Padding6,
> BufMappingLock_Padding7,
> BufFreelistLock,
> BufFreelistLock_Padding1,
> BufFreelistLock_Padding2,
> BufFreelistLock_Padding3,
> BufFreelistLock_Padding4,
> BufFreelistLock_Padding5,
> BufFreelistLock_Padding6,
> BufFreelistLock_Padding7,
> LockMgrLock,
> LockMgrLock_Padding1,
> LockMgrLock_Padding2,
> LockMgrLock_Padding3,
> LockMgrLock_Padding4,
> LockMgrLock_Padding5,
> LockMgrLock_Padding6,
> LockMgrLock_Padding7,
> OidGenLock,
> XidGenLock,
> ProcArrayLock,
> SInvalLock,
> FreeSpaceLock,
> WALInsertLock,
> WALInsertLock_Padding1,
> WALInsertLock_Padding2,
> WALInsertLock_Padding3,
> WALInsertLock_Padding4,
> WALInsertLock_Padding5,
> WALInsertLock_Padding6,
> WALInsertLock_Padding7,
> WALWriteLock,
> WALWriteLock_Padding1,
> WALWriteLock_Padding2,
> WALWriteLock_Padding3,
> WALWriteLock_Padding4,
> WALWriteLock_Padding5,
> WALWriteLock_Padding6,
> WALWriteLock_Padding7,
> ...
>
> where the number of padding locks is determined by how many lock
> structures fit within a 128 byte cache line.
>
> This isn't exactly elegant coding, but it provides a useful improvement
> on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> pretty darn stupid. But it does follow the CPU optimization handbook
> advice and I did see a noticeable improvement in performance and a
> reduction in context switching.
>
> I'm not in a position to try this again now on 8.1beta, but I'd welcome
> a performance test result from anybody that is. I'll supply a patch
> against 8.1beta for anyone wanting to test this.

I'll take a patch. I have a 4 processor (8core) POWER5 system just
about ready.

Mark


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Emil Briggs <emil(at)baymountain(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-10-17 18:58:38
Message-ID: 20051017185838.GG86144@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 12, 2005 at 03:07:23PM -0400, Emil Briggs wrote:
> > where the number of padding locks is determined by how many lock
> > structures fit within a 128 byte cache line.
> >
> > This isn't exactly elegant coding, but it provides a useful improvement
> > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> > pretty darn stupid. But it does follow the CPU optimization handbook
> > advice and I did see a noticeable improvement in performance and a
> > reduction in context switching.
> >
> > I'm not in a position to try this again now on 8.1beta, but I'd welcome
> > a performance test result from anybody that is. I'll supply a patch
> > against 8.1beta for anyone wanting to test this.
> >
>
> I don't have an 8 way available right now but I can run tests on a 4 way
> Opteron if that would be helpful.

Likewise I can test on a 4 way opteron running open solaris.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Mark Wong <markw(at)osdl(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-10-20 22:03:47
Message-ID: 1129845827.8300.822.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote:
> >
> > This isn't exactly elegant coding, but it provides a useful improvement
> > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> > pretty darn stupid. But it does follow the CPU optimization handbook
> > advice and I did see a noticeable improvement in performance and a
> > reduction in context switching.

> > I'm not in a position to try this again now on 8.1beta, but I'd welcome
> > a performance test result from anybody that is. I'll supply a patch
> > against 8.1beta for anyone wanting to test this.
>
> Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which
> I've just set up and probably needs a bit of tuning. I don't see much
> difference but I'm wondering if the cacheline sizes are dramatically
> different from Intel/AMD processors. I still need to take a closer look
> to make sure I haven't grossly mistuned anything, but I'll let everyone
> take a look:

Well, the Power 5 architecture probably has the lowest overall memory
delay you can get currently so in some ways that would negate the
effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's
clear the patch isn't significantly better (like it was with 8.0 when we
tried this on the 8-way Itanium in Feb).

> cvs 20051013
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/
> 2501 notpm
>
> cvs 20051013 w/ lw.patch
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/
> 2519 notpm

Could you re-run with wal_buffers = 32 ? (Without patch) Thanks

Best Regards, Simon Riggs


From: Mark Wong <markw(at)osdl(dot)org>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-11-01 00:10:44
Message-ID: 200511010012.jA10C3W6026010@smtp.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 20 Oct 2005 23:03:47 +0100
Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote:
> > >
> > > This isn't exactly elegant coding, but it provides a useful improvement
> > > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> > > pretty darn stupid. But it does follow the CPU optimization handbook
> > > advice and I did see a noticeable improvement in performance and a
> > > reduction in context switching.
>
> > > I'm not in a position to try this again now on 8.1beta, but I'd welcome
> > > a performance test result from anybody that is. I'll supply a patch
> > > against 8.1beta for anyone wanting to test this.
> >
> > Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which
> > I've just set up and probably needs a bit of tuning. I don't see much
> > difference but I'm wondering if the cacheline sizes are dramatically
> > different from Intel/AMD processors. I still need to take a closer look
> > to make sure I haven't grossly mistuned anything, but I'll let everyone
> > take a look:
>
> Well, the Power 5 architecture probably has the lowest overall memory
> delay you can get currently so in some ways that would negate the
> effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's
> clear the patch isn't significantly better (like it was with 8.0 when we
> tried this on the 8-way Itanium in Feb).
>
> > cvs 20051013
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/
> > 2501 notpm
> >
> > cvs 20051013 w/ lw.patch
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/
> > 2519 notpm
>
> Could you re-run with wal_buffers = 32 ? (Without patch) Thanks

Ok, sorry for the delay. I've bumped up the wal_buffers to 2048 and
redid the disk layout. Here's where I'm at now:

cvs 20051013
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/
3257 notpm

cvs 20051013 w/ lw.patch
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/
3285 notpm

Still not much of a difference with the patch. A quick glance over the
iostat data suggests I'm still not i/o bound, but the i/o wait is rather
high according to vmstat. Will try to see if there's anything else
obvious to get the load up higher.

Mark


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Mark Wong <markw(at)osdl(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-11-01 07:32:32
Message-ID: 1130830352.8300.1563.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2005-10-31 at 16:10 -0800, Mark Wong wrote:
> On Thu, 20 Oct 2005 23:03:47 +0100
> Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
> > On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote:
> > > >
> > > > This isn't exactly elegant coding, but it provides a useful improvement
> > > > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> > > > pretty darn stupid. But it does follow the CPU optimization handbook
> > > > advice and I did see a noticeable improvement in performance and a
> > > > reduction in context switching.
> >
> > > > I'm not in a position to try this again now on 8.1beta, but I'd welcome
> > > > a performance test result from anybody that is. I'll supply a patch
> > > > against 8.1beta for anyone wanting to test this.
> > >
> > > Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which
> > > I've just set up and probably needs a bit of tuning. I don't see much
> > > difference but I'm wondering if the cacheline sizes are dramatically
> > > different from Intel/AMD processors. I still need to take a closer look
> > > to make sure I haven't grossly mistuned anything, but I'll let everyone
> > > take a look:
> >
> > Well, the Power 5 architecture probably has the lowest overall memory
> > delay you can get currently so in some ways that would negate the
> > effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's
> > clear the patch isn't significantly better (like it was with 8.0 when we
> > tried this on the 8-way Itanium in Feb).
> >
> > > cvs 20051013
> > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/
> > > 2501 notpm
> > >
> > > cvs 20051013 w/ lw.patch
> > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/
> > > 2519 notpm
> >
> > Could you re-run with wal_buffers = 32 ? (Without patch) Thanks
>
> Ok, sorry for the delay. I've bumped up the wal_buffers to 2048 and
> redid the disk layout. Here's where I'm at now:
>
> cvs 20051013
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/
> 3257 notpm
>
> cvs 20051013 w/ lw.patch
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/
> 3285 notpm
>
> Still not much of a difference with the patch. A quick glance over the
> iostat data suggests I'm still not i/o bound, but the i/o wait is rather
> high according to vmstat. Will try to see if there's anything else
> obvious to get the load up higher.

OK, thats fine. I'm glad there's some gain, but not much yet. I think we
should leave out doing any more tests on lw.patch for now.

Concerned about the awful checkpointing. Can you bump wal_buffers to
8192 just to make sure? Thats way too high, but just to prove it.

We need to rdeuce the number of blocks to be written at checkpoint.

bgwriter_all_maxpages 5 -> 15
bgwriter_all_percent 0.333
bgwriter_delay 200
bgwriter_lru_maxpages 5 -> 7
bgwriter_lru_percent 1

shared_buffers set lower to 100000
(which should cause some amusement on-list)

Best Regards, Simon Riggs


From: Mark Wong <markw(at)osdl(dot)org>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-11-03 16:03:48
Message-ID: 200511031605.jA3G57W6031430@smtp.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 01 Nov 2005 07:32:32 +0000
Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On Mon, 2005-10-31 at 16:10 -0800, Mark Wong wrote:
> > On Thu, 20 Oct 2005 23:03:47 +0100
> > Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> >
> > > On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote:
> > > > >
> > > > > This isn't exactly elegant coding, but it provides a useful improvement
> > > > > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> > > > > pretty darn stupid. But it does follow the CPU optimization handbook
> > > > > advice and I did see a noticeable improvement in performance and a
> > > > > reduction in context switching.
> > >
> > > > > I'm not in a position to try this again now on 8.1beta, but I'd welcome
> > > > > a performance test result from anybody that is. I'll supply a patch
> > > > > against 8.1beta for anyone wanting to test this.
> > > >
> > > > Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which
> > > > I've just set up and probably needs a bit of tuning. I don't see much
> > > > difference but I'm wondering if the cacheline sizes are dramatically
> > > > different from Intel/AMD processors. I still need to take a closer look
> > > > to make sure I haven't grossly mistuned anything, but I'll let everyone
> > > > take a look:
> > >
> > > Well, the Power 5 architecture probably has the lowest overall memory
> > > delay you can get currently so in some ways that would negate the
> > > effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's
> > > clear the patch isn't significantly better (like it was with 8.0 when we
> > > tried this on the 8-way Itanium in Feb).
> > >
> > > > cvs 20051013
> > > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/
> > > > 2501 notpm
> > > >
> > > > cvs 20051013 w/ lw.patch
> > > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/
> > > > 2519 notpm
> > >
> > > Could you re-run with wal_buffers = 32 ? (Without patch) Thanks
> >
> > Ok, sorry for the delay. I've bumped up the wal_buffers to 2048 and
> > redid the disk layout. Here's where I'm at now:
> >
> > cvs 20051013
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/
> > 3257 notpm
> >
> > cvs 20051013 w/ lw.patch
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/
> > 3285 notpm
> >
> > Still not much of a difference with the patch. A quick glance over the
> > iostat data suggests I'm still not i/o bound, but the i/o wait is rather
> > high according to vmstat. Will try to see if there's anything else
> > obvious to get the load up higher.
>
> OK, thats fine. I'm glad there's some gain, but not much yet. I think we
> should leave out doing any more tests on lw.patch for now.
>
> Concerned about the awful checkpointing. Can you bump wal_buffers to
> 8192 just to make sure? Thats way too high, but just to prove it.
>
> We need to rdeuce the number of blocks to be written at checkpoint.
>
> bgwriter_all_maxpages 5 -> 15
> bgwriter_all_percent 0.333
> bgwriter_delay 200
> bgwriter_lru_maxpages 5 -> 7
> bgwriter_lru_percent 1
>
> shared_buffers set lower to 100000
> (which should cause some amusement on-list)

Okay, here goes, all with the same source base w/ the lw.patch:

http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/44/
only increased wal_buffers to 8192 from 2048
3242 notpm

http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/43/
only increased bgwriter_all_maxpages to 15, and bgwriter_lru_maxpages to 7
3019 notpm (but more interesting graph)

http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/45/
Same as the previously listen run with hared_buffers lowered to 10000
2503 notpm

Mark


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Mark Wong <markw(at)osdl(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-11-03 18:29:09
Message-ID: 1131042549.8300.2002.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2005-11-03 at 08:03 -0800, Mark Wong wrote:
> On Tue, 01 Nov 2005 07:32:32 +0000
> Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > Concerned about the awful checkpointing. Can you bump wal_buffers to
> > 8192 just to make sure? Thats way too high, but just to prove it.
> >
> > We need to rdeuce the number of blocks to be written at checkpoint.
> >
> > bgwriter_all_maxpages 5 -> 15
> > bgwriter_all_percent 0.333
> > bgwriter_delay 200
> > bgwriter_lru_maxpages 5 -> 7
> > bgwriter_lru_percent 1
> >
> > shared_buffers set lower to 100000
> > (which should cause some amusement on-list)
>
>
> Okay, here goes, all with the same source base w/ the lw.patch:
>
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/44/
> only increased wal_buffers to 8192 from 2048
> 3242 notpm

That looks to me like a clear negative effect from increasing
wal_buffers. Try putting it back down to 1024.
Looks like we need to plug that gap.

> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/43/
> only increased bgwriter_all_maxpages to 15, and bgwriter_lru_maxpages to 7
> 3019 notpm (but more interesting graph)

Man that sucks. What the heck is happening there? Hackers - if you
watching you should see this graph - it shows some very poor behaviour.

I'm not happy with that performance at all.... any chance you could re-
run that exact same test to see if we can get that repeatably?

I see you have
vm.dirty_writeback_centisecs = 0

which pretty much means we aren't ever writing to disk by the pdflush
daemons, even when the bgwriter is active.

Could we set the bgwriter stuff back to default and try
vm.dirty_writeback_centisecs = 500

> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/45/
> Same as the previously listen run with hared_buffers lowered to 10000
> 2503 notpm

Sorry, that was 100,000 not 10,000.

Looks like we need dates on the log_line_prefix so we can check the
logs.

...not sure about the oprofile results. Seems to show CreateLWLocks
being as high as xlog_insert, which is mad. Either that shows startup
time is excessive, or it means the oprofile timing range is too short.
Not sure which.

Best Regards, Simon Riggs


From: Mark Wong <markw(at)osdl(dot)org>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-11-04 21:09:14
Message-ID: 200511042110.jA4LAWnO009682@smtp.osdl.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 03 Nov 2005 18:29:09 +0000
Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On Thu, 2005-11-03 at 08:03 -0800, Mark Wong wrote:
> > On Tue, 01 Nov 2005 07:32:32 +0000
> > Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > > Concerned about the awful checkpointing. Can you bump wal_buffers to
> > > 8192 just to make sure? Thats way too high, but just to prove it.
> > >
> > > We need to rdeuce the number of blocks to be written at checkpoint.
> > >
> > > bgwriter_all_maxpages 5 -> 15
> > > bgwriter_all_percent 0.333
> > > bgwriter_delay 200
> > > bgwriter_lru_maxpages 5 -> 7
> > > bgwriter_lru_percent 1
> > >
> > > shared_buffers set lower to 100000
> > > (which should cause some amusement on-list)
> >
> >
> > Okay, here goes, all with the same source base w/ the lw.patch:
> >
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/44/
> > only increased wal_buffers to 8192 from 2048
> > 3242 notpm
>
> That looks to me like a clear negative effect from increasing
> wal_buffers. Try putting it back down to 1024.
> Looks like we need to plug that gap.
>
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/43/
> > only increased bgwriter_all_maxpages to 15, and bgwriter_lru_maxpages to 7
> > 3019 notpm (but more interesting graph)
>
> Man that sucks. What the heck is happening there? Hackers - if you
> watching you should see this graph - it shows some very poor behaviour.
>
> I'm not happy with that performance at all.... any chance you could re-
> run that exact same test to see if we can get that repeatably?
>
> I see you have
> vm.dirty_writeback_centisecs = 0
>
> which pretty much means we aren't ever writing to disk by the pdflush
> daemons, even when the bgwriter is active.
>
> Could we set the bgwriter stuff back to default and try
> vm.dirty_writeback_centisecs = 500

http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/47/
3309 notpm

> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/45/
> > Same as the previously listen run with hared_buffers lowered to 10000
> > 2503 notpm
>
> Sorry, that was 100,000 not 10,000.

Oops!
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/46/
2794 notpm

> Looks like we need dates on the log_line_prefix so we can check the
> logs.

Oops again! I didn't check to make sure I had set this correctly before
I ran the last two tests, I'll get on it.

> ...not sure about the oprofile results. Seems to show CreateLWLocks
> being as high as xlog_insert, which is mad. Either that shows startup
> time is excessive, or it means the oprofile timing range is too short.
> Not sure which.

Yeah, we've seen this before. I think I'll have to try pulling the
oprofile cvs code to see if there's any improvement. I've been working
with oprofile-0.9.1.

Mark


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2006-06-14 21:47:13
Message-ID: 200606142147.k5ELlDl28970@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Added to TODO list.

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote:
> > I wrote:
> > > Another thought came to mind: maybe the current data layout for LWLocks
> > > is bad. Right now, the spinlock that protects each LWLock data struct
> > > is itself part of the struct, and since the structs aren't large (circa
> > > 20 bytes), the whole thing is usually all in the same cache line. ...
> > > Maybe it'd be better to allocate the spinlocks off by themselves.
> >
> > Well, this is odd. I made up a patch to do this (attached) and found
> > that it pretty much sucks. Still the 4-way Opteron, previous best
> > (slock-no-cmpb and spin-delay-2):
> > 1 31s 2 42s 4 51s 8 100s
> > with lwlock-separate added:
> > 1 31s 2 52s 4 106s 8 213s
> >
> > What I had expected to see was a penalty in the single-thread case (due
> > to instructions added to the LWLock functions), but there isn't any.
> > I did not expect to see a factor-of-2 penalty for four threads.
> >
> > I guess what this means is that there's no real problem with losing the
> > cache line while manipulating the LWLock, which is what the patch was
> > intended to prevent. Instead, we're paying for swapping two cache lines
> > (the spinlock and the LWLock) across processors instead of just one line.
> > But that should at worst be a 2x inflation of the time previously spent
> > in LWLockAcquire/Release, which is surely not yet all of the application
> > ;-). Why the heck is this so bad?
>
> (Just going back through the whole thread for completeness.)
>
> > Should we expect that apparently
> > minor changes in shared data structures might be costing equivalently
> > huge penalties in SMP performance elsewhere?
>
> Yes. That's the advice from Intel and AMD; but we should add that there
> is potential for improving performance also.
>
> The possible problem we were trying to avoid here was false sharing of
> the cache line by lock requestors of locks whose spin locks were
> adjacent in memory.
>
> Splitting the data structure was just one of the possible ways of
> avoiding that. The above shows that the possible solution described
> above didn't work, but there could be others.
>
> One thing we tried in February was padding out the statically defined
> locks with dummy lock definitions in the enum. This has the effect of
> ensuring that the most contested locks are very definitely in their own
> cache line and not shared with others.
> That showed a noticeable improvement in performance, probably because it
> costs very little to implement, even if the code would require some
> explanatory documentation.
>
> The lwlock structure in include/storage/lwlock.h looks like
>
> typedef enum LWLockId
> {
> BufMappingLock,
> BufFreelistLock,
> LockMgrLock,
> OidGenLock,
> XidGenLock,
> ProcArrayLock,
> SInvalLock,
> FreeSpaceLock,
> WALInsertLock,
> WALWriteLock,
> ...
>
> Notice that the heavily contested locks (i.e. first 3 and the WAL locks)
> are adjacent to at least one other heavily contested lock. So they are
> certain to be in the same cache line and therefore to cause false
> sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's
> optimization handbooks).
>
> This could be replaced by...
>
> typedef enum LWLockId
> {
> BufMappingLock,
> BufMappingLock_Padding1,
> BufMappingLock_Padding2,
> BufMappingLock_Padding3,
> BufMappingLock_Padding4,
> BufMappingLock_Padding5,
> BufMappingLock_Padding6,
> BufMappingLock_Padding7,
> BufFreelistLock,
> BufFreelistLock_Padding1,
> BufFreelistLock_Padding2,
> BufFreelistLock_Padding3,
> BufFreelistLock_Padding4,
> BufFreelistLock_Padding5,
> BufFreelistLock_Padding6,
> BufFreelistLock_Padding7,
> LockMgrLock,
> LockMgrLock_Padding1,
> LockMgrLock_Padding2,
> LockMgrLock_Padding3,
> LockMgrLock_Padding4,
> LockMgrLock_Padding5,
> LockMgrLock_Padding6,
> LockMgrLock_Padding7,
> OidGenLock,
> XidGenLock,
> ProcArrayLock,
> SInvalLock,
> FreeSpaceLock,
> WALInsertLock,
> WALInsertLock_Padding1,
> WALInsertLock_Padding2,
> WALInsertLock_Padding3,
> WALInsertLock_Padding4,
> WALInsertLock_Padding5,
> WALInsertLock_Padding6,
> WALInsertLock_Padding7,
> WALWriteLock,
> WALWriteLock_Padding1,
> WALWriteLock_Padding2,
> WALWriteLock_Padding3,
> WALWriteLock_Padding4,
> WALWriteLock_Padding5,
> WALWriteLock_Padding6,
> WALWriteLock_Padding7,
> ...
>
> where the number of padding locks is determined by how many lock
> structures fit within a 128 byte cache line.
>
> This isn't exactly elegant coding, but it provides a useful improvement
> on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> pretty darn stupid. But it does follow the CPU optimization handbook
> advice and I did see a noticeable improvement in performance and a
> reduction in context switching.
>
> I'm not in a position to try this again now on 8.1beta, but I'd welcome
> a performance test result from anybody that is. I'll supply a patch
> against 8.1beta for anyone wanting to test this.
>
> Best Regards, Simon Riggs
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2006-06-15 02:14:26
Message-ID: e6qfqa$jne$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote
>
> Added to TODO list.
>
> > One thing we tried in February was padding out the statically defined
> > locks with dummy lock definitions in the enum. This has the effect of
> > ensuring that the most contested locks are very definitely in their own
> > cache line and not shared with others.
> > That showed a noticeable improvement in performance, probably because it
> > costs very little to implement, even if the code would require some
> > explanatory documentation.
> >

Has this been done? See the LWLOCK_PADDED_SIZE macro in code.

Regards,
Qingqing


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2006-06-15 02:23:01
Message-ID: 200606150223.k5F2N1v26905@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Qingqing Zhou wrote:
>
> "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote
> >
> > Added to TODO list.
> >
> > > One thing we tried in February was padding out the statically defined
> > > locks with dummy lock definitions in the enum. This has the effect of
> > > ensuring that the most contested locks are very definitely in their own
> > > cache line and not shared with others.
> > > That showed a noticeable improvement in performance, probably because it
> > > costs very little to implement, even if the code would require some
> > > explanatory documentation.
> > >
>
> Has this been done? See the LWLOCK_PADDED_SIZE macro in code.

Oh, yes, thanks. I thought it had but I couldn't find anything in the
area of the code he propsed the patch.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2006-06-15 02:28:49
Message-ID: 16503.1150338529@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> writes:
>>> One thing we tried in February was padding out the statically defined
>>> locks with dummy lock definitions in the enum.

> Has this been done? See the LWLOCK_PADDED_SIZE macro in code.

Not really --- that patch was intended to ensure that LWLocks don't
unnecessarily cross two cache lines. It doesn't ensure that two
different LWLocks aren't sharing a cache line. You could do that
by increasing LWLOCK_PADDED_SIZE to the cache line size for your
hardware, if you know what that is.

I think a more effective answer might be to twiddle the order of
"enum LWLockId" items so that the most heavily used LWLocks aren't
close together. Haven't looked into it though.

regards, tom lane


From: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2006-06-15 02:58:53
Message-ID: e6qidu$r69$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>
> Not really --- that patch was intended to ensure that LWLocks don't
> unnecessarily cross two cache lines. It doesn't ensure that two
> different LWLocks aren't sharing a cache line. You could do that
> by increasing LWLOCK_PADDED_SIZE to the cache line size for your
> hardware, if you know what that is.
>
Exactly, this is one way -- if we make LWLOCK_PADDED_SIZE big enough, we can
assure that one lwlock one cacheline. If so, maybe we should plug in a check
like LMBench in ./configure to guess out current cacheline size. But this
way looks like overkill -- a compromise is to pad only some of the LWLocks
big enough but not all (for example, the buffer content lock array).

Regards,
Qingqing