Re: Redesigning checkpoint_segments

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Redesigning checkpoint_segments
Date: 2013-08-23 21:08:30
Message-ID: 5217CF4E.5080709@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03.07.2013 21:28, Peter Eisentraut wrote:
> On 6/6/13 4:09 PM, Heikki Linnakangas wrote:
>> Here's a patch implementing that. Docs not updated yet. I did not change
>> the way checkpoint_segments triggers checkpoints - that'll can be a
>> separate patch. This only decouples the segment preallocation behavior
>> from checkpoint_segments. With the patch, you can set
>> checkpoint_segments really high, without consuming that much disk space
>> all the time.
>
> I don't understand what this patch, by itself, will accomplish in terms
> of the originally stated goals of making checkpoint_segments easier to
> tune, and controlling disk space used. To some degree, it makes both of
> these things worse, because you can no longer use checkpoint_segments to
> control the disk space. Instead, it is replaced by magic.

The patch addressed the third point in my first post:

> A third point is that even if you have 10 GB of disk space reserved
> for WAL, you don't want to actually consume all that 10 GB, if it's
> not required to run the database smoothly. There are several reasons
> for that: backups based on a filesystem-level snapshot are larger
> than necessary, if there are a lot of preallocated WAL segments and
> in a virtualized or shared system, there might be other VMs or
> applications that could make use of the disk space. On the other
> hand, you don't want to run out of disk space while writing WAL -
> that can lead to a PANIC in the worst case.

> What sort of behavior are you expecting to come out of this? In testing,
> I didn't see much of a difference. Although I'd expect that this would
> actually preallocate fewer segments than the old formula.

For example, if you set checkpoint_segments to 200, and you temporarily
generate 100 segments of WAL during an initial data load, but the normal
workload generates only 20 segments between checkpoints. Without the
patch, you will permanently have about 120 segments in pg_xlog, created
by the spike. With the patch, the extra segments will be gradually
removed after the data load, down to the level needed by the constant
workload. That would be about 50 segments, assuming the default
checkpoint_completion_target=0.5.

Here's a bigger patch, which does more. It is based on the ideas in the
post I started this thread with, with feedback incorporated from the
long discussion. With this patch, WAL disk space usage is controlled by
two GUCs:

min_recycle_wal_size
checkpoint_wal_size

These GUCs act as soft minimum and maximum on overall WAL size. At each
checkpoint, the checkpointer removes enough old WAL files to keep
pg_xlog usage below checkpoint_wal_size, and recycles enough new WAL
files to reach min_recycle_wal_size. Between those limits, there is a
self-tuning mechanism to recycle just enough WAL files to get to end of
the next checkpoint without running out of preallocated WAL files. To
estimate how many files are needed for that, a moving average of how
much WAL is generated between checkpoints is calculated. The moving
average is updated with "fast-rise slow-decline" behavior, to cater for
peak rather than true average use to some extent.

As today, checkpoints are triggered based on time or WAL usage,
whichever comes first. WAL-based checkpoints are triggered based on the
good old formula: CheckPointSegments = (checkpoint_max_wal_size / (2.0 +
checkpoint_completion_target)) / 16MB. CheckPointSegments controls that
like before, but it is now an internal variable derived from
checkpoint_wal_size, not visible to users.

These settings are fairly intuitive for a DBA to tune. You begin by
figuring out how much disk space you can afford to spend on WAL, and set
checkpoint_wal_size to that (with some safety margin, of course). Then
you set checkpoint_timeout based on how long you're willing to wait for
recovery to finish. Finally, if you have infrequent batch jobs that need
a lot more WAL than the system otherwise needs, you can set
min_recycle_wal_size to keep enough WAL preallocated for the spikes.

You can also set min_recycle_wal_size = checkpoint_wal_size, which gets
you the same behavior as without the patch, except that it's more
intuitive to set it in terms of "MB of WAL space required", instead of
"# of segments between checkpoints".

Does that make sense? I'd love to hear feedback on how people setting up
production databases would like to tune these things. The reason for the
auto-tuning between the min and max is to be able to set reasonable
defaults e.g for embedded systems that don't have a DBA to do tuning.
Currently, it's very difficult to come up with a reasonable default
value for checkpoint_segments which would work well for a wide range of
systems. The PostgreSQL default of 3 is way way too low for most
systems. On the other hand, if you set it to, say, 20, that's a lot of
wasted space for a small database that's not updated much. With this
patch, you can set "max_wal_size=1GB" and if the database ends up
actually only needing 100 MB of WAL, it will only use that much and not
waste 900 MB for useless preallocated WAL files.

These GUCs are still soft limits. If the system is busy enough that the
checkpointer can't reach its target, it can exceed checkpoint_wal_size.
Making it a hard limit is a much bigger task than I'm willing to tackle
right now.

- Heikki

Attachment Content-Type Size
redesign-checkpoint-segments-1.patch text/x-diff 45.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2013-08-23 21:53:30 Re: Redesigning checkpoint_segments
Previous Message Pavel Stehule 2013-08-23 20:51:35 Re: PL/pgSQL PERFORM with CTE