Quick Links

Design notes for BufMgrLock rewrite

Lists:	pgsql-hackers

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Design notes for BufMgrLock rewrite
Date:	2005-02-13 22:07:52
Message-ID:	15341.1108332472@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I'm working on an experimental patch to break up the BufMgrLock along
the lines we discussed a few days ago --- in particular, using a clock
sweep algorithm instead of LRU lists for the buffer replacement strategy.
I started by writing up some design notes, which are attached for
review in case anyone has better ideas.

One thing I realized quickly is that there is no natural way in a clock
algorithm to discourage VACUUM from blowing out the cache. I came up
with a slightly ugly idea that's described below. Can anyone do better?

regards, tom lane

Buffer manager's internal locking
---------------------------------

Before PostgreSQL 8.1, all operations of the shared buffer manager itself
were protected by a single system-wide lock, the BufMgrLock, which
unsurprisingly proved to be a source of contention. The new locking scheme
avoids grabbing system-wide exclusive locks in common code paths. It works
like this:

* There is a system-wide LWLock, the BufMappingLock, that notionally
protects the mapping from buffer tags (page identifiers) to buffers.
(Physically, it can be thought of as protecting the hash table maintained
by buf_table.c.) To look up whether a buffer exists for a tag, it is
sufficient to obtain share lock on the BufMappingLock. Note that one
must pin the found buffer, if any, before releasing the BufMappingLock.
To alter the page assignment of any buffer, one must hold exclusive lock
on the BufMappingLock. This lock must be held across adjusting the buffer's
header fields and changing the buf_table hash table. The only common
operation that needs exclusive lock is reading in a page that was not
in shared buffers already, which will require at least a kernel call
and usually a wait for I/O, so it will be slow anyway.

* A separate system-wide LWLock, the BufFreelistLock, provides mutual
exclusion for operations that access the buffer free list or select
buffers for replacement. This is always taken in exclusive mode since
there are no read-only operations on those data structures. The buffer
management policy is designed so that BufFreelistLock need not be taken
except in paths that will require I/O, and thus will be slow anyway.
(Details appear below.) It is never necessary to hold the BufMappingLock
and the BufFreelistLock at the same time.

* Each buffer header contains a spinlock that must be taken when examining
or changing fields of that buffer header. This allows operations such as
ReleaseBuffer to make local state changes without taking any system-wide
lock. We use a spinlock, not an LWLock, since there are no cases where
the lock needs to be held for more than a few instructions.

Note that a buffer header's spinlock does not control access to the data
held within the buffer. Each buffer header also contains an LWLock, the
"buffer context lock", that *does* represent the right to access the data
in the buffer. It is used per the rules above.

There is yet another set of per-buffer LWLocks, the io_in_progress locks,
that are used to wait for I/O on a buffer to complete. The process doing
a read or write takes exclusive lock for the duration, and processes that
need to wait for completion try to take shared locks (which they release
immediately upon obtaining). XXX on systems where an LWLock represents
nontrivial resources, it's fairly annoying to need so many locks. Possibly
we could use per-backend LWLocks instead (a buffer header would then contain
a field to show which backend is doing its I/O).

Buffer replacement strategy
---------------------------

There is a "free list" of buffers that are prime candidates for replacement.
In particular, buffers that are completely free (contain no valid page) are
always in this list. We may also throw buffers into this list if we
consider their pages unlikely to be needed soon. The list is singly-linked
using fields in the buffer headers; we maintain head and tail pointers in
global variables. (Note: although the list links are in the buffer headers,
they are considered to be protected by the BufFreelistLock, not the
buffer-header spinlocks.) To choose a victim buffer to recycle when there
are no free buffers available, we use a simple clock-sweep algorithm, which
avoids the need to take system-wide locks during common operations. It
works like this:

Each buffer header contains a "recently used" flag bit, which is set true
whenever the buffer is unpinned. (Setting this bit requires only the
buffer header spinlock, which would have to be taken anyway to decrement
the buffer reference count, so it's nearly free.)

The "clock hand" is a buffer index, NextVictimBuffer, that moves circularly
through all the available buffers. NextVictimBuffer is protected by the
BufFreelistLock.

The algorithm for a process that needs to obtain a victim buffer is:

1. Obtain BufFreelistLock.

2. If buffer free list is nonempty, remove its head buffer. If the buffer
is pinned or has its "recently used" bit set, it cannot be used; ignore
it and return to the start of step 2. Otherwise, pin the buffer,
release BufFreelistLock, and return the buffer.

3. Otherwise, select the buffer pointed to by NextVictimBuffer, and
circularly advance NextVictimBuffer for next time.

4. If the selected buffer is pinned or has its "recently used" bit set,
it cannot be used. Clear its "recently used" bit and return to step 3
to examine the next buffer.

5. Pin the selected buffer, release BufFreelistLock, and return the buffer.

(Note that if the selected buffer is dirty, we will have to write it out
before we recycle it; if someone else pins the buffer meanwhile we will
have to give up and try another buffer. This however is not a concern
of the basic select-a-victim-buffer algorithm.)

This scheme selects only victim buffers that have gone unused since they
were last passed over by the "clock hand".

A special provision is that while running VACUUM, a backend does not set the
"recently used" bit on buffers it accesses. In fact, if ReleaseBuffer sees
that it is dropping the pin count to zero and the "recently used" bit is not
set, then it appends the buffer to the tail of the free list. (This implies
that VACUUM, but only VACUUM, must take the BufFreelistLock during
ReleaseBuffer; this shouldn't create much of a contention problem.) This
provision encourages VACUUM to work in a relatively small number of buffers
rather than blowing out the entire buffer cache. It is reasonable since a
page that has been touched only by VACUUM is unlikely to be needed again
soon.

Since VACUUM usually requests many pages very fast, the effect of this is that
it will get back the very buffers it filled and possibly modified on the next
call and will therefore do its work in a few shared memory buffers, while
being able to use whatever it finds in the cache already. This also implies
that most of the write traffic caused by a VACUUM will be done by the VACUUM
itself and not pushed off onto other processes.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-13 23:40:16
Message-ID:	200502132340.j1DNeGR19532@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> I'm working on an experimental patch to break up the BufMgrLock along
> the lines we discussed a few days ago --- in particular, using a clock
> sweep algorithm instead of LRU lists for the buffer replacement strategy.
> I started by writing up some design notes, which are attached for
> review in case anyone has better ideas.
>
> One thing I realized quickly is that there is no natural way in a clock
> algorithm to discourage VACUUM from blowing out the cache. I came up
> with a slightly ugly idea that's described below. Can anyone do better?

Uh, is the clock algorithm also sequential-scan proof? Is that
something that needs to be done too?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-13 23:56:47
Message-ID:	18377.1108339007@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> One thing I realized quickly is that there is no natural way in a clock
>> algorithm to discourage VACUUM from blowing out the cache. I came up
>> with a slightly ugly idea that's described below. Can anyone do better?

> Uh, is the clock algorithm also sequential-scan proof? Is that
> something that needs to be done too?

If you can think of a way. I don't see any way to make the algorithm
itself scan-proof, but if we modified the bufmgr API to tell ReadBuffer
(or better ReleaseBuffer) that a request came from a seqscan, we could
do the same thing as for VACUUM. Whether that's good enough isn't
clear --- for one thing it would kick up the contention for the
BufFreelistLock, and for another it might mean *too* short a lifetime
for blocks fetched by seqscan.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-14 01:07:58
Message-ID:	200502140107.j1E17wv02048@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Tom Lane wrote:
> >> One thing I realized quickly is that there is no natural way in a clock
> >> algorithm to discourage VACUUM from blowing out the cache. I came up
> >> with a slightly ugly idea that's described below. Can anyone do better?
>
> > Uh, is the clock algorithm also sequential-scan proof? Is that
> > something that needs to be done too?
>
> If you can think of a way. I don't see any way to make the algorithm
> itself scan-proof, but if we modified the bufmgr API to tell ReadBuffer
> (or better ReleaseBuffer) that a request came from a seqscan, we could
> do the same thing as for VACUUM. Whether that's good enough isn't
> clear --- for one thing it would kick up the contention for the
> BufFreelistLock, and for another it might mean *too* short a lifetime
> for blocks fetched by seqscan.

If I remember correctly, the ARC system was sequential-scan resistant
_only_ because it split the cache into two parts and let the sequential
scan wipe one cache while the other was for frequently accessed
information and was more immune, and the overhead for a frequent cache
requires too much locking.

One interesting aspect is that we now have only three buffer access
patterns, index pages, random heap lookups via index or ctid, and heap
scans. It would be interesting to control the behavior of each one. For
example, if the sequential scan is larger than the buffer cache, is
there any reason to cache any of it? Should non-sequential scan pages
be kept longer? Can we record on the buffer page how the page was
initially or most recently accessed? The problem with the kernel cache
is that is doesn't know the applicaiton access pattern, but we do so it
seems we can use that information to improve the algorithm.

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-14 05:11:03
Message-ID:	87bran680o.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:

> Tom Lane wrote:
> >
> > One thing I realized quickly is that there is no natural way in a clock
> > algorithm to discourage VACUUM from blowing out the cache. I came up
> > with a slightly ugly idea that's described below. Can anyone do better?
>
> Uh, is the clock algorithm also sequential-scan proof? Is that
> something that needs to be done too?

I think the normal strategy is to make it *always* work the way you made
VACUUM work. That is, it should always enter newly loaded pages with the
"recently used" flag false.

It doesn't necessarily mean they get purged immediately on the next flush, any
other buffer that hasn't been accessed since it was loaded is also a
candidate, but if nothing else accesses it before the clock hand gets to it
then it a candidate.

The only thing that scares me about this is that running a vacuum or
sequential scan could have too much of an effect on non-sequential accesses
like index scans if it forced the hand around so fast that the index scan
didn't have a chance to reuse pages.

--
greg

From:	Sailesh Krishnamurthy <sailesh(at)cs(dot)berkeley(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-14 08:45:22
Message-ID:	mjqhdkfil7h.fsf@drones.CS.Berkeley.EDU
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> and changing the buf_table hash table. The only common
Tom> operation that needs exclusive lock is reading in a page that
Tom> was not in shared buffers already, which will require at
Tom> least a kernel call and usually a wait for I/O, so it will be
Tom> slow anyway.

Why not a separate lock per bucket chain in the hash table in addition
to the system-wide LWLock ? It's not so much that such an operation will be
slow anyway but that such a slow operation will unnecessarily block
other operations.

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-16 17:20:09
Message-ID:	20050216172009.GR52357@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 13, 2005 at 06:56:47PM -0500, Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Tom Lane wrote:
> >> One thing I realized quickly is that there is no natural way in a clock
> >> algorithm to discourage VACUUM from blowing out the cache. I came up
> >> with a slightly ugly idea that's described below. Can anyone do better?
>
> > Uh, is the clock algorithm also sequential-scan proof? Is that
> > something that needs to be done too?
>
> If you can think of a way. I don't see any way to make the algorithm
> itself scan-proof, but if we modified the bufmgr API to tell ReadBuffer
> (or better ReleaseBuffer) that a request came from a seqscan, we could
> do the same thing as for VACUUM. Whether that's good enough isn't
> clear --- for one thing it would kick up the contention for the
> BufFreelistLock, and for another it might mean *too* short a lifetime
> for blocks fetched by seqscan.

Is there anything (in the buffer headers?) that keeps track of buffer
access frequency? *BSD uses a mechanism to track roughly how often a page
in memory has been accessed, and uses that to determine what pages to
free. In 4.3BSD, a simple 2 hand clock sweep is used; the first hand
sets a not-used bit in each page, the second hand (which sweeps a fixed
distance behind the 1st hand) checks this bit and if it's still clear
moves the page either to the inactive list if it's dirty, or to the
cache list if it's clean. There is also a free list, which is generally
fed by the cache and inactive lists.

Postgresql has a big advantage over an OS though, in that it can
tolerate much more overhead in buffer access code than an OS can in it's
vm system. If I understand correctly, any use of a buffer at all means a
lock needs to be aquired on it's buffer header. As part of this access,
a counter could be incremented with very little additional cost. A
background process would then sweep through 'active' buffers,
decrementing this counter by some amount. Any buffer that was
decremented below 0 would be considered inactive, and a candidate for
being freed. The advantage of using a counter instead of a simple active
bit is that buffers that are (or have been) used heavily will be able to
go through several sweeps of the clock before being freed. Infrequently
used buffers (such as those from a vacuum or seq. scan), would get
marked as inactive the first time they were hit by the clock hand.
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-16 17:33:38
Message-ID:	25460.1108575218@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jim C. Nasby" <decibel(at)decibel(dot)org> writes:
> The advantage of using a counter instead of a simple active
> bit is that buffers that are (or have been) used heavily will be able to
> go through several sweeps of the clock before being freed. Infrequently
> used buffers (such as those from a vacuum or seq. scan), would get
> marked as inactive the first time they were hit by the clock hand.

Hmm. It would certainly be nearly as easy to adjust a counter as to
manipulate the RECENTLY_USED flag bit that's in the patch now. (You
could imagine the RECENTLY_USED flag bit as a counter with max value 1.)

What I'm envisioning is that pinning (actually unpinning) a buffer
increments the counter (up to some limit), and the clock sweep
decrements it (down to zero), and only buffers with count zero are taken
by the sweep for recycling. That could work well, but I think the limit
needs to be relatively small, else we could have the clock sweep having
to go around many times before it finally frees a buffer. Any thoughts
about that? Anyone seen any papers about this sort of algorithm?

regards, tom lane

From:	Kenneth Marshall <ktm(at)it(dot)is(dot)rice(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-16 17:42:11
Message-ID:	20050216174211.GE7721@it.is.rice.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 16, 2005 at 12:33:38PM -0500, Tom Lane wrote:
> "Jim C. Nasby" <decibel(at)decibel(dot)org> writes:
> > The advantage of using a counter instead of a simple active
> > bit is that buffers that are (or have been) used heavily will be able to
> > go through several sweeps of the clock before being freed. Infrequently
> > used buffers (such as those from a vacuum or seq. scan), would get
> > marked as inactive the first time they were hit by the clock hand.
>
> Hmm. It would certainly be nearly as easy to adjust a counter as to
> manipulate the RECENTLY_USED flag bit that's in the patch now. (You
> could imagine the RECENTLY_USED flag bit as a counter with max value 1.)
>
> What I'm envisioning is that pinning (actually unpinning) a buffer
> increments the counter (up to some limit), and the clock sweep
> decrements it (down to zero), and only buffers with count zero are taken
> by the sweep for recycling. That could work well, but I think the limit
> needs to be relatively small, else we could have the clock sweep having
> to go around many times before it finally frees a buffer. Any thoughts
> about that? Anyone seen any papers about this sort of algorithm?
>
I have seen this algorithm described as a more generalized clock type
algorithm. As the size of the counter increases, up to the number of
buffers, the clock algorithm becomes LRU. One bit is the lightest
weight approximation. Increasing the number of bits or a count makes
the clock algorithm more closely approximate LRU. You need to balance
how long it takes to find a free buffer. That time increases as the
count size increases.

Ken

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	Kenneth Marshall <ktm(at)is(dot)rice(dot)edu>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-17 05:39:25
Message-ID:	20050217053925.GV52357@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 16, 2005 at 11:42:11AM -0600, Kenneth Marshall wrote:
> I have seen this algorithm described as a more generalized clock type
> algorithm. As the size of the counter increases, up to the number of
> buffers, the clock algorithm becomes LRU. One bit is the lightest
> weight approximation. Increasing the number of bits or a count makes
> the clock algorithm more closely approximate LRU. You need to balance
> how long it takes to find a free buffer. That time increases as the
> count size increases.

Yeah, the trick to this seems to be how you tweak the rate at which
stuff 'falls out' of the active list. I can think of 3 ways to
accomplish this:

1) Change the maximum value to count to (or the value at which a buffer
is considered no longer in use).

This has the disadvantage of changing how effective the count is. In an
extreme situation, it would retuce back to a single bit. It also won't
affect buffers that already have higher counts, meaning older data is
more likely to stay in buffer than newer data.

2) Change the amount the counter is incremented by on each use (and/or
the amount it's decremented by).

An example of this might be having the clock decrement by 10. Under a
light to medium load, the system might increment by 10 on each use, but
if the system starts getting pressed for free buffers, that could be
reduced.

A downside of this would be that it potentially requires more space in
each header to store a larger value. An advatage is that it allows more
flexability than #1. For example, if the decrement value is increased in
order to speed up reclaiming of buffers, it won't create a difference in
how buffers are weighted based on when they were accessed like #1 will.

3) Change the speed of the clock.

This is what BSD effectively does. The OS periodically checks to see how
many pages are available on the free list, as well as how many were
removed since the last check. This information is used to decide how
many pages the clock algorithm should attempt to free in the next time
period (which can be 0).

If a two-hand algorithm is used, the distance between the two hands can
also be varied.

I think #3 probably means you'd need a seperate process to handle the
clock and moving buffers to a free list. Or perhaps this gets tied in
with the background writer. This might mean more overhead, but it could
improve contention if it means only one process needs to aquire some of
these locks.

So much for a simple design discussion. :) Fortunately, #1 and #2 should
be easy to test. #3 will certainly require more code, but it would
probably be simpler to implement than having multiple backends running
the clock algorithm (which I think is the current plan).

Something else I thought of; by using a counter instead of a bit, you
can also 'pre-seed' buffers based on why they were populated. For
example, pages brought in from an index might start with a value of 4;
heap pages 3, heap pages from a seqscan 2, and pages from vacuum, 1, or
maybe even 0.
--
Jim C. Nasby, Database Consultant decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

From:	Richard Huxton <dev(at)archonet(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-17 08:14:08
Message-ID:	42145250.1050202@archonet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> "Jim C. Nasby" <decibel(at)decibel(dot)org> writes:
>
>>The advantage of using a counter instead of a simple active
>>bit is that buffers that are (or have been) used heavily will be able to
>>go through several sweeps of the clock before being freed. Infrequently
>>used buffers (such as those from a vacuum or seq. scan), would get
>>marked as inactive the first time they were hit by the clock hand.

> What I'm envisioning is that pinning (actually unpinning) a buffer
> increments the counter (up to some limit), and the clock sweep
> decrements it (down to zero), and only buffers with count zero are taken
> by the sweep for recycling.

Would there be any value in incrementing by 2 for index accesses and 1
for seq-scans/vacuums? Actually, it should probably be a ratio based on
random_page_cost shouldn't it?

--
Richard Huxton
Archonet Ltd

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-21 04:19:00
Message-ID:	200502210419.j1L4J0613918@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> "Jim C. Nasby" <decibel(at)decibel(dot)org> writes:
> > The advantage of using a counter instead of a simple active
> > bit is that buffers that are (or have been) used heavily will be able to
> > go through several sweeps of the clock before being freed. Infrequently
> > used buffers (such as those from a vacuum or seq. scan), would get
> > marked as inactive the first time they were hit by the clock hand.
>
> Hmm. It would certainly be nearly as easy to adjust a counter as to
> manipulate the RECENTLY_USED flag bit that's in the patch now. (You
> could imagine the RECENTLY_USED flag bit as a counter with max value 1.)
>
> What I'm envisioning is that pinning (actually unpinning) a buffer
> increments the counter (up to some limit), and the clock sweep
> decrements it (down to zero), and only buffers with count zero are taken
> by the sweep for recycling. That could work well, but I think the limit
> needs to be relatively small, else we could have the clock sweep having
> to go around many times before it finally frees a buffer. Any thoughts
> about that? Anyone seen any papers about this sort of algorithm?

One idea would be for the clock to check X% of the buffer cache and just
recycle the page it saw with the lowest usage count.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-21 22:35:09
Message-ID:	1109025309.3801.133.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

My understanding from this is:
If we have a buffer cache hit ratio of 93%, then we should expect:
- 93% of buffer requests to require only shared BufMappingLocks
- 7% of buffer requests would require an exclusive BufFreelistLock then
an exclusive BufMappingLock.

That seems like an improvement would come from allowing multiple
successful simultaneous cache hits, which would be welcome and that is a
very good thing.

I also like the simplicity with which the bgwriter will be able to
easily stay ahead of the clock sweep, writing out dirty buffers, without
taking exclusive system-wide locks. This would produce a
StrategyDirtyBufferList design that is not dependant upon size of
shared_buffers, further improving the efficacy of this new design since
it will allow us to further increase shared_buffers.

ISTM this new design will increase scalability directly in line with the
cache hit ratio, but would still suffer from poor scalability for cache
misses. That concerns me, since a large table scan would require all-
exclusive locks to complete its scan, since it will typically be 95%+
cache misses. That would mean that well tuned OLTP applications could be
more scalable, but DW or mixed applications would not be. Servers with
few CPUs would not see this difference in as marked a way as higher-end
servers.

The cache would also be spoiled from scans, though I think we can handle
those the same way as Vacuum.

This design seems to be a clear improvement on the current design. I am
still encouraged that the freelist structures should be subdivided into
many smaller pieces, thereby producing finer grained locks (the earlier
bufferpools proposal). This could be implemented as an additional
feature on top of this patch, or as an alternate design on cvstip.

[It might be worth having separate bufferpools for indexes and heap
blocks, so that seq scans never effect the index cache.]

Whatever is done from here, I think it is certain that we can improve
things by providing hints from the higher code layers down to the buffer
management layer, as everybody keeps suggesting for Vacuum.

[I'm assuming that there are no system-wide locks held across I/Os, that
bit seems a bit unclear from the description]

Best Regards, Simon Riggs

On Sun, 2005-02-13 at 17:07 -0500, Tom Lane wrote:
> I'm working on an experimental patch to break up the BufMgrLock along
> the lines we discussed a few days ago --- in particular, using a clock
> sweep algorithm instead of LRU lists for the buffer replacement strategy.
> I started by writing up some design notes, which are attached for
> review in case anyone has better ideas.
>
> One thing I realized quickly is that there is no natural way in a clock
> algorithm to discourage VACUUM from blowing out the cache. I came up
> with a slightly ugly idea that's described below. Can anyone do better?
>
> regards, tom lane
>
>
> Buffer manager's internal locking
> ---------------------------------
>
> Before PostgreSQL 8.1, all operations of the shared buffer manager itself
> were protected by a single system-wide lock, the BufMgrLock, which
> unsurprisingly proved to be a source of contention. The new locking scheme
> avoids grabbing system-wide exclusive locks in common code paths. It works
> like this:
>
> * There is a system-wide LWLock, the BufMappingLock, that notionally
> protects the mapping from buffer tags (page identifiers) to buffers.
> (Physically, it can be thought of as protecting the hash table maintained
> by buf_table.c.) To look up whether a buffer exists for a tag, it is
> sufficient to obtain share lock on the BufMappingLock. Note that one
> must pin the found buffer, if any, before releasing the BufMappingLock.
> To alter the page assignment of any buffer, one must hold exclusive lock
> on the BufMappingLock. This lock must be held across adjusting the buffer's
> header fields and changing the buf_table hash table. The only common
> operation that needs exclusive lock is reading in a page that was not
> in shared buffers already, which will require at least a kernel call
> and usually a wait for I/O, so it will be slow anyway.
>
> * A separate system-wide LWLock, the BufFreelistLock, provides mutual
> exclusion for operations that access the buffer free list or select
> buffers for replacement. This is always taken in exclusive mode since
> there are no read-only operations on those data structures. The buffer
> management policy is designed so that BufFreelistLock need not be taken
> except in paths that will require I/O, and thus will be slow anyway.
> (Details appear below.) It is never necessary to hold the BufMappingLock
> and the BufFreelistLock at the same time.
>
> * Each buffer header contains a spinlock that must be taken when examining
> or changing fields of that buffer header. This allows operations such as
> ReleaseBuffer to make local state changes without taking any system-wide
> lock. We use a spinlock, not an LWLock, since there are no cases where
> the lock needs to be held for more than a few instructions.
>
> Note that a buffer header's spinlock does not control access to the data
> held within the buffer. Each buffer header also contains an LWLock, the
> "buffer context lock", that *does* represent the right to access the data
> in the buffer. It is used per the rules above.
>
> There is yet another set of per-buffer LWLocks, the io_in_progress locks,
> that are used to wait for I/O on a buffer to complete. The process doing
> a read or write takes exclusive lock for the duration, and processes that
> need to wait for completion try to take shared locks (which they release
> immediately upon obtaining). XXX on systems where an LWLock represents
> nontrivial resources, it's fairly annoying to need so many locks. Possibly
> we could use per-backend LWLocks instead (a buffer header would then contain
> a field to show which backend is doing its I/O).
>
>
> Buffer replacement strategy
> ---------------------------
>
> There is a "free list" of buffers that are prime candidates for replacement.
> In particular, buffers that are completely free (contain no valid page) are
> always in this list. We may also throw buffers into this list if we
> consider their pages unlikely to be needed soon. The list is singly-linked
> using fields in the buffer headers; we maintain head and tail pointers in
> global variables. (Note: although the list links are in the buffer headers,
> they are considered to be protected by the BufFreelistLock, not the
> buffer-header spinlocks.) To choose a victim buffer to recycle when there
> are no free buffers available, we use a simple clock-sweep algorithm, which
> avoids the need to take system-wide locks during common operations. It
> works like this:
>
> Each buffer header contains a "recently used" flag bit, which is set true
> whenever the buffer is unpinned. (Setting this bit requires only the
> buffer header spinlock, which would have to be taken anyway to decrement
> the buffer reference count, so it's nearly free.)
>
> The "clock hand" is a buffer index, NextVictimBuffer, that moves circularly
> through all the available buffers. NextVictimBuffer is protected by the
> BufFreelistLock.
>
> The algorithm for a process that needs to obtain a victim buffer is:
>
> 1. Obtain BufFreelistLock.
>
> 2. If buffer free list is nonempty, remove its head buffer. If the buffer
> is pinned or has its "recently used" bit set, it cannot be used; ignore
> it and return to the start of step 2. Otherwise, pin the buffer,
> release BufFreelistLock, and return the buffer.
>
> 3. Otherwise, select the buffer pointed to by NextVictimBuffer, and
> circularly advance NextVictimBuffer for next time.
>
> 4. If the selected buffer is pinned or has its "recently used" bit set,
> it cannot be used. Clear its "recently used" bit and return to step 3
> to examine the next buffer.
>
> 5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
>
> (Note that if the selected buffer is dirty, we will have to write it out
> before we recycle it; if someone else pins the buffer meanwhile we will
> have to give up and try another buffer. This however is not a concern
> of the basic select-a-victim-buffer algorithm.)
>
> This scheme selects only victim buffers that have gone unused since they
> were last passed over by the "clock hand".
>
> A special provision is that while running VACUUM, a backend does not set the
> "recently used" bit on buffers it accesses. In fact, if ReleaseBuffer sees
> that it is dropping the pin count to zero and the "recently used" bit is not
> set, then it appends the buffer to the tail of the free list. (This implies
> that VACUUM, but only VACUUM, must take the BufFreelistLock during
> ReleaseBuffer; this shouldn't create much of a contention problem.) This
> provision encourages VACUUM to work in a relatively small number of buffers
> rather than blowing out the entire buffer cache. It is reasonable since a
> page that has been touched only by VACUUM is unlikely to be needed again
> soon.
>
> Since VACUUM usually requests many pages very fast, the effect of this is that
> it will get back the very buffers it filled and possibly modified on the next
> call and will therefore do its work in a few shared memory buffers, while
> being able to use whatever it finds in the cache already. This also implies
> that most of the write traffic caused by a VACUUM will be done by the VACUUM
> itself and not pushed off onto other processes.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-21 23:01:18
Message-ID:	3388.1109026878@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> [I'm assuming that there are no system-wide locks held across I/Os, that
> bit seems a bit unclear from the description]

That's always been true and still is, so I didn't dwell on it. Only a
per-buffer lock is held while doing either input or output.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-21 23:07:29
Message-ID:	1109027249.3801.149.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2005-02-21 at 18:01 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > [I'm assuming that there are no system-wide locks held across I/Os, that
> > bit seems a bit unclear from the description]
>
> That's always been true and still is, so I didn't dwell on it. Only a
> per-buffer lock is held while doing either input or output.

[Me too, thats why its in brackets at the bottom.]

...but do you agree with my comments on the lack of scalability in cache
miss situations?

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-21 23:45:37
Message-ID:	3635.1109029537@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> ...but do you agree with my comments on the lack of scalability in cache
> miss situations?

No. Grabbing a lock during a cache miss is the least of your worries;
you're going to do I/O, or at least a kernel call, so it hardly matters
as long as you're not holding the lock for a time that's long in
comparison to that overhead.

The only test case I've seen that exposes a significant amount of bufmgr
contention is one that involves zero I/O (100% cache hit rate), so that
the fraction of time spent holding the BufMgrLock is a significant part
of the total time. As soon as you move off 100%, the bufmgr isn't the
critical path anymore. So I think the fact that this redesign is able
to reduce the contention at all in that case is just gravy. (It does
reduce contention because ReleaseBuffer doesn't take a global lock
anymore, and because BufMappingLock and BufFreelistLock are separate
locks.)

If testing shows that we still have contention issues with this design
then we can try subdividing the BufFreelistLock --- but right now my
guess is that we'd just be giving up more cache management efficiency
in return for not much.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-22 01:12:51
Message-ID:	4201.1109034771@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> This design seems to be a clear improvement on the current design. I am
> still encouraged that the freelist structures should be subdivided into
> many smaller pieces, thereby producing finer grained locks (the earlier
> bufferpools proposal).

As I already said, I'm dubious about that idea because of the consequent
reduction of cache management efficiency (since any particular page has
to fight for survival in a smaller pool). It occurs to me however that
we could split up the BufMappingLock in a similar fashion at minimal
efficiency penalty.

The idea is to replace the global tag->buffer hash table by 2^N separate
tables; you determine which one to use based on the low-order N bits of
the hash code for the buffer tag, which you always know when accessing
these tables. Then give each of these tables its own lock. Essentially
this subdivides the buffer tag space into 2^N independent slices.

This does not quite work perfectly; the tricky part comes when
reclaiming a buffer for use as another page. In the patch as it stands,
once we've written out the prior buffer contents we can atomically
check for a conflict and reassign the buffer because we need only the
one BufMapping lock to do it. But with this idea the old and new
associations might belong to different tables. I think the logic would
have to be
lock old mapping table for buffer;
check buffer's not dirty (if so unlock and start over)
remove mapping from old table;
unlock old table;
// at this point we have pin on a completely unassigned buffer
lock new mapping table for buffer;
check for conflict against someone having already made same entry
if found, unlock, put buffer in freelist, use other buffer;
insert mapping into new table;
unlock new table;
This costs us an extra lock/unlock cycle, plus in case of a conflict
we end up having unnecessarily evicted a page from cache. But conflicts
should be pretty rare, so I think the penalty isn't that great.

I don't currently believe that we need this extra complexity, but I
thought I'd get the idea into the archives in case it does turn out
to be useful later.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Design notes for BufMgrLock rewrite
Date:	2005-02-22 21:20:24
Message-ID:	1109107224.3801.180.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2005-02-21 at 18:45 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > ...but do you agree with my comments on the lack of scalability in cache
> > miss situations?
>
> No. Grabbing a lock during a cache miss is the least of your worries;
> you're going to do I/O, or at least a kernel call, so it hardly matters
> as long as you're not holding the lock for a time that's long in
> comparison to that overhead.

The I/O does alleviate contention to a certain extent, but if you have a
well laid out system that can soak up the I/O you're throwing AND you
have multiple CPUs trying to get at blocks, then you have contention.

The other problem is the OS cache. A PostgreSQL cache miss isn't
necessarily an I/O. If PostgreSQL more easily supported very large
shared_buffers then I would be more in agreement.

> The only test case I've seen that exposes a significant amount of bufmgr
> contention is one that involves zero I/O (100% cache hit rate), so that
> the fraction of time spent holding the BufMgrLock is a significant part
> of the total time. As soon as you move off 100%, the bufmgr isn't the
> critical path anymore. So I think the fact that this redesign is able
> to reduce the contention at all in that case is just gravy. (It does
> reduce contention because ReleaseBuffer doesn't take a global lock
> anymore, and because BufMappingLock and BufFreelistLock are separate
> locks.)

Let's talk about Mark's TPC-C like tests. As soon as the cache is full,
the response times go to hell. (see
http://www.osdl.org/projects/dbt2dev/results/dev4-010/264/)
Once the cache is full, each dirty cache miss costs two BufMgrLock
calls. On larger caches, very roughly 80% of the cache is dirty, so the
overall rise in contention is around 1.6 times what it was before. I see
that as a possible indicator of the effects of BufMgrLock contention.

(It does
> reduce contention because ReleaseBuffer doesn't take a global lock
> anymore, and because BufMappingLock and BufFreelistLock are separate
> locks.)

Yes, understood.

> If testing shows that we still have contention issues with this design
> then we can try subdividing the BufFreelistLock --- but right now my
> guess is that we'd just be giving up more cache management efficiency
> in return for not much.

OK to that.

[and please remember, all, that I'm discussing the very highest end of
performance architecture...]

Best Regards, Simon Riggs