Re: Redesigning checkpoint_segments

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Redesigning checkpoint_segments
Date: 2013-06-06 08:11:55
Message-ID: 51B0444B.2020800@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06.06.2013 06:20, Joshua D. Drake wrote:
> 3. The spread checkpoints have always confused me. If anything we want a
> checkpoint to be fast and short because:

(I'm sure you know this, but:) If you perform a checkpoint as fast and
short as possible, the sudden burst of writes and fsyncs will overwhelm
the I/O subsystem, and slow down queries. That's what we saw before
spread checkpoints: when a checkpoint happens, the response times of
queries jumped up.

> 4. Bgwriter. We should be adjusting bgwriter so that it is writing
> everything in a manner that allows any checkpoint to be in the range of
> never noticed.

Oh, I see where you're going. Yeah, that would be one way to do it.
However, spread checkpoints has pretty much the same effect. Imagine
that you tune your system like this: disable bgwriter altogether, and
set checkpoint_completion_target=0.9. With that, there will be a
checkpoint in progress most of the time, because by the time one
checkpoint completes, it's almost time to begin the next one already. In
that case, the checkpointer will be slowly performing the writes, all
the time, in the background, without affecting queries. The effect is
the same as what you described above, except that it's the checkpointer
doing the writing, not bgwriter.

As it happens, that's pretty much what you get with the default settings.

> Now perhaps my customers workloads are different but for us:
>
> 1. Checkpoint timeout is set as high as reasonable, usually 30 minutes
> to an hour. I wish I could set them even further out.
>
> 2. Bgwriter is set to be aggressive but not obtrusive. Usually adjusting
> based on an actual amount of IO bandwidth it may take per second based
> on their IO constraints. (Note I know that wal_writer comes into play
> here but I honestly don't remember where and am reading up on it to
> refresh my memory).

I've heard people just turning off bgwriter because it doesn't have much
effect anyway. You might want to try that, and if checkpoints cause I/O
spikes, raise checkpoint_completion_target instead.

> 3. The biggest issue we see with checkpoint segments is not running out
> of space because really.... 10GB is how many checkpoint segments? It is
> with wal_keep_segments. If we don't want to fill up the pg_xlog
> directory, put the wal logs that are for keep_segments elsewhere.

Yeah, wal_keep_segments is a hack. We should replace it with something
else, like having a registry of standbys in the master, and how far
they've streamed. That way the master could keep around the amount of
WAL actually needed by them, not more not less. But that's a different
story.

> Other oddities:
>
> Yes checkpoint_segments is awkward. We shouldn't have to set it at all.
> It should be gone.

The point of having checkpoint_segments or max_wal_size is to put a
limit (albeit a soft one) on the amount of disk space used. If you don't
care about that, I guess we could allow max_wal_size=-1 to mean
infinite, and checkpoints would be driven off purely based on time, not
WAL consumption.

> Basically we start with X amount perhaps to be set at
> initdb time. That X amount changes dynamically based on the amount of
> data being written. In order to not suffer from recycling and creation
> penalties we always keep X+N where N is enough to keep up with new data.

To clarify, here you're referring to controlling the number of WAL
segments preallocated/recycled, rather than how often checkpoints are
triggered. Currently, both are derived from checkpoint_segments, but I
proposed to separate them. The above is exactly what I proposed to do
for the preallocation/recycling, it would be tuned automatically, but
you still need something like max_wal_size for the other thing, to
trigger a checkpoint if too much WAL is being consumed.

> Along with the above, I don't see any reason for checkpoint_timeout.
> Because of bgwriter we should be able to rather indefinitely not worry
> about checkpoints (with a few exceptions such as pg_start_backup()).
> Perhaps a setting that causes a checkpoint to happen based on some
> non-artificial threshold (timeout) such as amount of data currently in
> need of a checkpoint?

Either I'm not understanding what you said, or you're confused. The
point of checkpoint_timeout is put a limit on the time it will take to
recover in case of crash. The relation between the two,
checkpoint_timeout and how long it will take to recover after a crash,
it not straightforward, but that's the best we have.

Bgwriter does not worry about checkpoints. By "amount of data currently
in need of a checkpoint", do you mean the number of dirty buffers in
shared_buffers, or something else? I don't see how or why that should
affect when you perform a checkpoint.

> Heikki said, "I propose that we do something similar, but not exactly
> the same. Let's have a setting, max_wal_size, to control the max. disk
> space reserved for WAL. Once that's reached (or you get close enough, so
> that there are still some segments left to consume while the checkpoint
> runs), a checkpoint is triggered.
>
> In this proposal, the number of segments preallocated is controlled
> separately from max_wal_size, so that you can set max_wal_size high,
> without actually consuming that much space in normal operation. It's
> just a backstop, to avoid completely filling the disk, if there's a
> sudden burst of activity. The number of segments preallocated is
> auto-tuned, based on the number of segments used in previous checkpoint
> cycles. "
>
> This makes sense except I don't see a need for the parameter. Why not
> just specify how the algorithm works and adhere to that without the need
> for another GUC?

Because you want to limit the amount of disk space used for WAL. It's a
soft limit, but still.

> Perhaps at any given point we save 10% of available
> space (within a 16MB calculation) for pg_xlog, you hit it, we checkpoint
> and LOG EXACTLY WHY.

Ah, but we don't know how much disk space is available. Even if we did,
there might be quotas or other constraints on the amount that we can
actually use. Or the DBA might not want PostgreSQL to use up all the
space, because there are other processes on the same system that need it.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-06-06 08:33:34 Re: MVCC catalog access
Previous Message Magnus Hagander 2013-06-06 08:01:56 Re: Make targets of doc links used by phpPgAdmin static