RFC: bufmgr locking changes

From: Neil Conway <neilc(at)samurai(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RFC: bufmgr locking changes
Date: 2004-01-07 22:37:09
Message-ID: 87k7431j3e.fsf@mailbox.samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've attached a (gzip'ed) patch that makes the following changes to
the buffer manager:

(1) Overhaul locking; whenever the code needs to modify the state
of an individual buffer, do synchronization via a per-buffer
"meta data lock" rather than the global BufMgrLock. For more
info on the motivation for this, see the several recent
-hackers threads on the subject.

(2) Reduce superfluous dirtying of pages: previously,
LockBuffer(buf, LW_EXCLUSIVE) would automatically dirty the
buffer. This behavior has been removed, to reduce the number
of pages that are dirtied. For more info on the motivation for
this, see the previous -hackers thread on the topic.

(3) Simplify the code in a bunch of places

This basically involved rewriting bufmgr.c (which makes the patch a
little illegible; it might be easier to just apply it and read through
the new code). That also means there are still a fair number of bugs
and unresolved issues.

The new locking works as follows:

- modifying the shared buffer table (buf_table.c) or making calls
into the freelist code (freelist.c) still requires holding the
BufMgrLock. There was some talk of trying to add more granular
locking to freelist.c itself: if there is still significant
contention for the BufMgrLock after these changes have been
applied, I'll take a look at doing that

- to modify a BufferDesc's meta data (shared refcount, flags, tag,
etc.), one must hold the buffer's "meta data lock". Also, I
removed the existing "io_in_progress" lock; instead, we now hold
the buffer's meta data lock while doing buffer I/O.

- if a backend holds the BufMgrLock, it will never try to
LWLockAcquire() a per-buffer meta data lock, due to the risk of
deadlock (and the loss in concurrency if we got blocked waiting
while still holding the BufMgrLock). Instead it does a
LWLockConditionalAcquire() and handles an acquisition failure
sanely

- if a backend holds the meta data lock for a buffer, it CAN
attempt to LWLockAcquire() the BufMgrLock. This is safe from
deadlocks, due to the previous paragraph.

The code is still a work-in-progress (running `pgbench -s 30 -c 20 -t
1000` bails out pretty quickly, for example), but any comments on
whether this scheme is in-theory correct would be very welcome.

For #2, my understanding of the existing XLOG code is incomplete, so
perhaps someone can tell me if I'm on the right track. I've modified a
few of the XLogInsert() call sites so that they now:

- acquire an exclusive buffer cntx_lock

- modify the content of the page/buffer

- WriteNoReleaseBuffer() to mark the buffer as dirty; since we
still hold the buffer's cntx_lock, a checkpoint process can't
write this page to disk yet. This replaces the implicit "mark
this page as dirty" behavior of LockBuffer()

- do the XLogInsert()

- release the cntx_lock

- ReleaseBuffer() to decrement the buffer's pincount

For example, sequence.c now works like this (I've probably missed a
few places) -- is this correct, and/or is there a better way to do it?

Notes, and caveats:

- Remove SetBufferCommitInfoNeedsSave(). AFAICS, this is now
completely equivalent to WriteNoReleaseBuffer(), so I just removed
the former and replaced all the calls to it with calls to the later.

- Moved the code for flushing a dirty buffer to disk to buf_io.c

- Moved UnpinBuffer() and PinBuffer() to bufmgr.c, from freelist.c

- Removed the following now-unused buffer flag bits: BM_VALID,
BM_DELETED, BM_JUST_DIRTIED, BM_IO_IN_PROGRESS, and shrunk the
'flags' field of the BufferDesc down to 8 bits (from 16)

- Removed the cntxDirty field from the BufferDesc: now that we don't
need to acquire the BufMgrLock to modify the buffer's flags, there's
no reason to keep this around

- Make 'PrivateRefCount' an array of uint16s, rather than longs. This
saves 2 bits * shared_buffers per backend on 32-bit machines and 6
bits * shared_buffers per backend on some 64-bit machines. It means
a given backend can only pin a single buffer 65,636 times, but that
should be more than enough. Similarly, made LocalRefCount an array
of uint16s.

I was thinking of adding overflow checking to the refcount increment
code to make sure we fail safely if a backend *does* try to exceed
this number of pins, but I can't imagine a scenario when it would
actually occur, so I haven't bothered.

- Remove the BufferLocks array. AFAICS this wasn't actually necessary.

- A few loose ends in the code still need to be wrapped up (for
example, I need to take another glance at the pin-waiter stuff, and
the error handling still needs some more work), but I think most of
the functionality is there. Areas of concern are denoted by 'XXX'
comments.

Any comments?

-Neil

Attachment Content-Type Size
buffer-locking-45.patch.gz application/octet-stream 31.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kurt Roeckx 2004-01-08 01:05:18 Re: RFC: bufmgr locking changes
Previous Message Alex Satrapa 2004-01-07 22:22:06 [OT]"Copyright infringement" vs "piracy" (was Re: Paypal)