Re: Spinlocks, yet again: analysis and proposed patches

From: Marko Kreen <marko(at)l-t(dot)ee>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-13 20:46:11
Message-ID: 20050913204611.GA24295@l-t.ee
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 13, 2005 at 10:10:13AM -0400, Tom Lane wrote:
> Marko Kreen <marko(at)l-t(dot)ee> writes:
> > On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
> >> However, given that we are only expecting
> >> the spinlock to be held for a couple dozen instructions, using the
> >> kernel futex mechanism is huge overkill --- the in-kernel overhead
> >> to manage the futex state is almost certainly several orders of
> >> magnitude more than the delay we actually want.
>
> > Why do you think so? AFAIK on uncontented case there will be no
> > kernel access, only atomic inc/dec.
>
> In the uncontended case, we never even enter s_lock() and so the entire
> mechanism of yielding is irrelevant. The problem that's being exposed
> by these test cases is that on multiprocessors, you can see a
> significant rate of spinlock contention (order of 100 events/second,
> which is still a tiny fraction of the number of TAS calls) and our
> existing mechanism for dealing with contention is just not efficient
> enough.
>
> > On contented case you'll want task switch anyway, so the futex
> > managing should not matter.
>
> No, we DON'T want a task switch. That's the entire point: in a
> multiprocessor, it's a good bet that the spinlock is held by a task
> running on another processor, and doing a task switch will take orders
> of magnitude longer than just spinning until the lock is released.
> You should yield only after spinning long enough to make it a strong
> probability that the spinlock is held by a process that's lost the
> CPU and needs to be rescheduled.


Hmm. I guess this could be separated into 2 cases:

1. Light load - both lock owner and lock requester wont get
scheduled while busy (owner in critical section, waiter
spinning.)
2. Big load - either or both of them gets scheduled while busy.
(waiter is scheduled by OS or voluntarily by eg. calling select())

So my impression is that currently you optimize for #1 at the
expense of #2, while with futexes you'd optimize for #2 at
the expense of #1. Additionally I'm pretty convinced that
futexes give you most efficient implementation for #2, as kernel
knows what processes are waiting on particular lock so it can
make best decisions for scheduling.

> > If you don't want Linux-specific locking in core code, then
> > it's another matter...
>
> Well, it's true, we don't particularly want a one-platform solution,
> but if it did what we wanted we might hold our noses and use it anyway.
>
> (I think, BTW, that using futexes at the spinlock level is misguided;
> what would be interesting would be to see if we could substitute for
> both LWLock and spinlock logic with one futex-based module.)

Use pthreads ;)

> >> I also saw fairly frequent "stuck spinlock" panics when running
> >> more queries than there were processors --- this despite increasing
> >> NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield
> >> anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect.
> >> (I speculate that it's set up to only yield the processor to other
> >> processes already affiliated to that processor. In any case, it
> >> is definitely capable of getting through 10000 yields without
> >> running the guy who's holding the spinlock.)
>
> > This is intended behaviour of sched_yield.
>
> > http://lwn.net/Articles/31462/
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=112432727428224&w=2
>
> No; that page still says specifically "So a process calling
> sched_yield() now must wait until all other runnable processes in the
> system have used up their time slices before it will get the processor
> again." I can prove that that is NOT what happens, at least not on
> a multi-CPU Opteron with current FC4 kernel. However, if the newer
> kernels penalize a process calling sched_yield as heavily as this page
> claims, then it's not what we want anyway ...

My fault. As I saw that there is problem with sched_yield, I
said "I bet this is because of behaviour change" and only
skimmed the exact details. But the point that sched_yield is
not meant for such usage still stands.

About fast yielding, comment on sys_sched_yield() says:

* sys_sched_yield - yield the current processor to other threads.
*
* this function yields the current CPU by moving the calling thread
* to the expired array. If there are no other threads running on this
* CPU then this function will return.

So there just is nothing else to schedule on that CPU.

--
marko

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-09-13 21:18:23 Re: Spinlocks, yet again: analysis and proposed patches
Previous Message Tom Lane 2005-09-13 19:40:48 Re: Spinlocks, yet again: analysis and proposed patches