Load Distributed Checkpoints, revised patch

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Load Distributed Checkpoints, revised patch
Date: 2007-06-15 10:34:22
Message-ID: 46726B2E.7060606@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Here's an updated WIP version of the LDC patch. I just spreads the
writes, that achieves the goal of smoothing the checkpoint I/O spikes. I
think sorting the writes etc. is interesting but falls in the category
of further development and should be pushed to 8.4.

The documentation changes are not complete, GUC variables need
descriptions, and some of the DEBUG elogs will go away in favor of the
separate checkpoint logging patch that's in the queue. I'm fairly happy
with the code now, but there's a few minor open issues:

- What units should we use for the new GUC variables? From
implementation point of view, it would be simplest if
checkpoint_write_rate is given as pages/bgwriter_delay, similarly to
bgwriter_*_maxpages. I never liked those *_maxpages settings, though, a
more natural unit from users perspective would be KB/s.

- The signaling between RequestCheckpoint and bgwriter is a bit tricky.
Bgwriter now needs to deal immediate checkpoint requests, like those
coming from explicit CHECKPOINT or CREATE DATABASE commands, differently
from those triggered by checkpoint_segments. I'm afraid there might be
race conditions when a CHECKPOINT is issued at the same instant as
checkpoint_segments triggers one. What might happen then is that the
checkpoint is performed lazily, spreading the writes, and the CHECKPOINT
command has to wait for that to finish which might take a long time. I
have not been able to convince myself neither that the race condition
exists or that it doesn't.

A few notes about the implementation:

- in bgwriter loop, CheckArchiveTimeout always calls time(NULL), while
previously it used the value returned by another call earlier in the
same codepath. That means we now call time(NULL) twice instead of once
per bgwriter iteration, when archive_timout is set. That doesn't seem
significant to me, so I didn't try to optimize it.

- because of a small change in the meaning of force_checkpoint flag in
bgwriter loop, checkpoints triggered by reaching checkpoint_segments
call CreateCheckPoint(false, false) instead of CreateCheckPoint(false,
true). That second argument is the "force"-flag. If it's false,
CreateCheckPoint skips the checkpoint if there's been no WAL activity
since last checkpoint. It doesn't matter in this case, there surely has
been WAL activity if we reach checkpoint_segments, and doing the check
isn't that expensive.

- to coordinate the writes with with checkpoint_segments, we need to
read the WAL insertion location. To do that, we need to acquire the
WALInsertLock. That means that in the worst case, WALInsertLock is
acquired every bgwriter_delay when a checkpoint is in progress. I don't
think that's a problem, it's only held for a very short duration, but I
thought I'd mention it.

- How should we deal with changing GUC variables that affect LDC, on the
fly when a checkpoint is in progress? The attached patch finishes the
in-progress checkpoint ASAP, and reloads the config after that. We could
reload the config immediately, but making the new settings effective
immediately is not trivial.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
ldc-justwrites-2.patch text/x-diff 32.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2007-06-15 10:55:02 Re: Sorted writes in checkpoint
Previous Message Teodor Sigaev 2007-06-15 10:32:16 Re: How does the tsearch configuration get selected?

Browse pgsql-patches by date

  From Date Subject
Next Message Simon Riggs 2007-06-15 10:55:02 Re: Sorted writes in checkpoint
Previous Message ITAGAKI Takahiro 2007-06-15 09:33:47 Re: Sorted writes in checkpoint