Re: Limit of bgwriter_lru_maxpages of max. 1000?

From: Gerhard Wiesinger <lists(at)wiesinger(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Limit of bgwriter_lru_maxpages of max. 1000?
Date: 2009-10-04 20:19:21
Message-ID: alpine.LFD.2.00.0910042153050.6653@bbs.intern
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, 2 Oct 2009, Greg Smith wrote:

> On Fri, 2 Oct 2009, Scott Marlowe wrote:
>
>> I found that lowering checkpoint completion target was what helped.
>> Does that seem counter-intuitive to you?
>

I set it to 0.0 now.

> Generally, but there are plenty of ways you can get into a state where a
> short but not immediate checkpoint is better. For example, consider a case
> where your buffer cache is filled with really random stuff. There's a
> sorting horizon in effect, where your OS and/or controller makes decisions
> about what order to write things based on the data it already has around, not
> really knowing what's coming in the near future.
>

Ok, if checkpoint doesn't block anything on normal operation time doesn't
really matter.

> Let's say you've got 256MB of cache in the disk controller, you have 1GB of
> buffer cache to write out, and there's 8GB of RAM in the server so it can
> cache the whole write. If you wrote it out in a big burst, the OS would
> elevator sort things and feed them to the controller in disk order. Very
> efficient, one pass over the disk to write everything out.
>
> But if you broke that up into 256MB write pieces instead on the database
> side, pausing after each chunk was written, the OS would only be sorting
> across 256MB at a time, and would basically fill the controller cache up with
> that before it saw the larger picture. The disk controller can end up making
> seek decisions with that small of a planning window now that are not really
> optimal, making more passes over the disk to write the same data out. If the
> timing between the DB write cache and the OS is pathologically out of sync
> here, the result can end up being slower than had you just written out in
> bigger chunks instead. This is one reason I'd like to see fsync calls happen
> earlier and more evenly than they do now, to reduce these edge cases.
>
> The usual approach I take in this situation is to reduce the amount of write
> caching the OS does, so at least things get more predictable. A giant write
> cache always gives the best average performance, but the worst-case behavior
> increases at the same time.
>
> There was a patch floating around at one point that sorted all the checkpoint
> writes by block order, which would reduce how likely it is you'll end up in
> one of these odd cases. That turned out to be hard to nail down the benefit
> of though, because in a typical case the OS caching here trumps any I/O
> scheduling you try to do in user land, and it's hard to repeatibly generate
> scattered data in a benchmark situation.
>

Ok, on a basic insert test and a systemtap script
(http://www.wiesinger.com/opensource/systemtap/postgresql-checkpoint.stp)
checkpoint is still a major I/O spike.

################################################################################
Buffers between : Sun Oct 4 18:29:50 2009, synced 55855 buffer(s), flushed 744 buffer(s) between checkpoint
Checkpoint start: Sun Oct 4 18:29:50 2009
Checkpoint end : Sun Oct 4 18:29:56 2009, synced 12031 buffer(s), flushed 12031 buffer(s)
################################################################################
Buffers between : Sun Oct 4 18:30:20 2009, synced 79000 buffer(s), flushed 0 buffer(s) between checkpoint
Checkpoint start: Sun Oct 4 18:30:20 2009
Checkpoint end : Sun Oct 4 18:30:26 2009, synced 10753 buffer(s), flushed 10753 buffer(s)
################################################################################
Buffers between : Sun Oct 4 18:30:50 2009, synced 51120 buffer(s), flushed 1007 buffer(s) between checkpoint
Checkpoint start: Sun Oct 4 18:30:50 2009
Checkpoint end : Sun Oct 4 18:30:56 2009, synced 11899 buffer(s), flushed 11912 buffer(s)
################################################################################

Ok, I further had a look at the code to understand the behavior of the
buffercache and the background writer since that wasn't logically.

So as far as I saw the basic algorithm is:
1.) Normally (non checkpoints) only dirty and non recently used pages
(usage_count == 0) are flushed to disk. I think that's basically fine as a
strategy as indexes might update blocks more than once. It's also ok that
blocks are written and not flushed (well be done on checkpoint time).
2.) At checkpoints write out all dirty buffers and flush all previously
written and newly written. Also spreading I/O seems also ok to me now.

BUT: I think I've found 2 major bugs in the implementation (or I didn't
understand something correctly). Codebase analyzed is 8.3.8 since I
currently use it.

##############################################
Bug1: usage_count is IHMO not consistent
##############################################
I think this has been introduced with:
http://git.postgresql.org/gitweb?p=postgresql.git;a=blobdiff;f=src/backend/storage/buffer/bufmgr.c;h=6e6b862273afea40241e410e18fd5d740c2b1643;hp=97f7822077de683989a064cdc624a025f85e54ab;hb=ebf3d5b66360823edbdf5ac4f9a119506fccd4c0;hpb=98ffa4e9bd75c8124378c712933bb13d2697b694

So either usage_count = 1 init in BufferAlloc is not correct or
SyncOneBuffer() with skip_recently_used and usage_count=1 is not correct:
if (bufHdr->refcount == 0 && bufHdr->usage_count == 0)
result |= BUF_REUSABLE;
else if (skip_recently_used)
{
/* Caller told us not to write recently-used buffers */
UnlockBufHdr(bufHdr);
return result;
}

##############################################
Bug2: Double iteration of buffers
##############################################
As you can seen in the calling tree below there is double iteration
with buffers involved. This might be a major performance bottleneck.

// Checkpoint buffer sync
BufferSync()
loop buffers:
SyncOneBuffer() // skip_recently_used=false
CheckpointWriteDelay() // Bug here?: Since BgBufferSync() is called were again is iterated!!

CheckpointWriteDelay()
if (IsCheckpointOnSchedule())
{ BgBufferSync()
CheckArchiveTimeout()
BgWriterNap()
}

BgBufferSync()
loop buffers:
SyncOneBuffer() // skip_recently_used=true, ok here since we don't want to flush recently used block (e.g. indices). But improvement (e.g. aging) is IHMO necessary
##############################################

BTW: Are there some tests available how fast a buffer cache hit is and a
disk cache hit is (not in the buffer cache but in the disk cache)? I'll
asked, because a lot of locking is involved in the code.

BTW2: Oracle buffercache and background writer strategy is also
interesting.
http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/process.htm#i7259
http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/memory.htm#i10221

Thnx for feedback.

Ciao,
Gerhard

--
http://www.wiesinger.com/

-----------------------------------
src/backend/postmaster/bgwriter.c
-----------------------------------
BackgroundWriterMain()
loop forever:
timeout:
CreateCheckPoint() // NON_IMMEDIATE
smgrcloseall()
nontimeout:
BgBufferSync()
sleep
// Rest is done in XLogWrite()

RequestCheckpoint()
CreateCheckPoint() or signal through shared memory segment
smgrcloseall()

CheckpointWriteDelay()
if (IsCheckpointOnSchedule())
{ BgBufferSync()
CheckArchiveTimeout()
BgWriterNap()
}

-----------------------------------
src/backend/commands/dbcommands.c
-----------------------------------
createdb()
RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT

dropdb()
RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT
-----------------------------------
src/backend/commands/tablespace.c
-----------------------------------
DropTableSpace()
RequestCheckpoint()

-----------------------------------
src/backend/tcop/utility.c
-----------------------------------
ProcessUtility()
// Command CHECKPOINT;
RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT

-----------------------------------
src/backend/access/transam/xlog.c
-----------------------------------
CreateCheckPoint()
CheckPointGuts()
CheckPointCLOG()
CheckPointSUBTRANS()
CheckPointMultiXact()
CheckPointBuffers()); /* performs all required fsyncs */
CheckPointTwoPhase()

XLogWrite()
too_much_transaction_log_consumed:
RequestCheckpoint() // NON_IMMEDIATE

pg_start_backup()
RequestCheckpoint() // CHECKPOINT_FORCE | CHECKPOINT_WAIT

XLogFlush()
// Flush transaction log

-----------------------------------
src/backend/storage/buffer/bufmgr.c
-----------------------------------
CheckPointBuffers()
BufferSync()
smgrsync()

// Checkpoint buffer sync
BufferSync()
loop buffers:
SyncOneBuffer() // skip_recently_used=false
CheckpointWriteDelay() // Bug here?: Since BgBufferSync() is called were again is iterated!!

// Backgroundwriter buffer sync
BgBufferSync()
loop buffers:
SyncOneBuffer() // skip_recently_used=true, ok here since we don't want to flush recently used block (e.g. indices). But improvement (e.g. aging) is IHMO necessary

SyncOneBuffer() // Problem with skip_recently_used and usage_count=1 (not flushed!)
FlushBuffer()

FlushBuffer()
XLogFlush()
smgrwrite()

CheckPointBuffers()
BufferSync()
smgrsync()

BufferAlloc() // Init with usage_count=1 is not logically => Will never be flushed in bg_writer!
PinBuffer()

PinBuffer()
usage_count++;

-----------------------------------
src/backend/storage/buffer/localbuf.c
-----------------------------------
LocalBufferAlloc()
usage_count++;

-----------------------------------
src/backend/storage/smgr/md.c
-----------------------------------
smgrwrite() = mdwrite()
=> write file (not flushed immediatly) but registers for later flushling with register_dirty_segment at checkpoint time

smgrsync() = mdsync()
=> Syncs registered non flushed files.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message justin 2009-10-04 22:29:48 Re: How useful is the money datatype?
Previous Message Sam Mason 2009-10-04 19:38:14 Re: How useful is the money datatype?