BufFreelistLock

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: BufFreelistLock
Date: 2010-12-09 04:28:00
Message-ID: AANLkTin+jt+fV9SKqFdkxy0RbbPg_LPeTpvCQVBwYQiJ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I think that the BufFreelistLock can be a contention bottleneck on a
system with a lot of CPUs that do a lot of shared-buffer allocations
which can fulfilled by the OS buffer cache. That is, read-mostly
queries where the working data set fits in RAM, but not in
shared_buffers. (You can always increase shared_buffers, but that
leads to other problems, and who wants to spend their time
micromanaging the size of shared_buffers as work loads slowly change?)

I can't prove it is a contention bottleneck without first solving the
putative problem and timing the difference, but it is the dominant
blocking lock showing up under LWLOCK_STATS for one benchmark I've
done using 8 CPUs.

So I had two questions.

1) Would it be useful for BufFreelistLock be partitioned, like
BufMappingLock, or via some kind of clever "virtual partitioning" that
could get the same benefit via another means? I don't know if both
the linked list and the clock sweep would have to be partitioned, or
if some other arrangement could be made

2) Could BufFreelistLock simply go away, by reducing it from a lwlock
to a spinlock? Or at least in most common paths?

For doing away with it, I think that any manipulation of the freelist
is short enough (just a few instructions) that it could be done under
a spinlock. If you somehow obtained a pinned or usage_count buffer,
you would have to retake the spinlock to look at the new head of the
chain, but the comments StrategyGetBuffer suggest that that should be
rare or impossible.

For the clock sweep algorithm, I think you could access
nextVictimBuffer without any type of locking. If a non-atomic
increment causes an occasional buffer to be skipped or examined twice,
that doesn't seem like a correctness problem. When nextVictimBuffer
gets reset to zero and completePasses gets incremented, that would
probably need to be protected to prevent a double-increment of
completePasses from throwing off the background writer's usage
estimations. But again, a spinlock should be enough for that. And it
shouldn't occur all that often.

If potentially inaccurate non-atomic increments of numBufferAllocs are
a problem, it could be incremented under the same spinlock used to
protect the test firstFreeBuffer>0 to determine if the freelist is
empty.

Doing away with the lock without some form of partitioning might just
move the contention to the BufHdr spinlocks. But if most of the
processes entering the code at about the same time perceive each
others increments to nextVictimBuffer, they would all start out offset
from each other and shouldn't collide too badly.

Does any of this sound like it might be fruitful to look into?

Cheers,

Jeff

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-12-09 04:49:21 Re: BufFreelistLock
Previous Message David E. Wheeler 2010-12-09 04:20:28 Re: Review: Extensions Patch