Re: Redesigning checkpoint_segments

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Redesigning checkpoint_segments
Date: 2013-09-01 07:37:38
Message-ID: CAA4eK1+G93H8aYp3wCjcQODYZ9wRs5XDaecFeeM2Zhv4Ubd3_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Aug 24, 2013 at 2:38 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 03.07.2013 21:28, Peter Eisentraut wrote:
>>
>> On 6/6/13 4:09 PM, Heikki Linnakangas wrote:
>>>
>>> Here's a patch implementing that. Docs not updated yet. I did not change
>>> the way checkpoint_segments triggers checkpoints - that'll can be a
>>> separate patch. This only decouples the segment preallocation behavior
>>> from checkpoint_segments. With the patch, you can set
>>> checkpoint_segments really high, without consuming that much disk space
>>> all the time.
>>
>>
>> I don't understand what this patch, by itself, will accomplish in terms
>> of the originally stated goals of making checkpoint_segments easier to
>> tune, and controlling disk space used. To some degree, it makes both of
>> these things worse, because you can no longer use checkpoint_segments to
>> control the disk space. Instead, it is replaced by magic.
>
>
> The patch addressed the third point in my first post:
>
>> A third point is that even if you have 10 GB of disk space reserved
>> for WAL, you don't want to actually consume all that 10 GB, if it's
>> not required to run the database smoothly. There are several reasons
>> for that: backups based on a filesystem-level snapshot are larger
>> than necessary, if there are a lot of preallocated WAL segments and
>> in a virtualized or shared system, there might be other VMs or
>> applications that could make use of the disk space. On the other
>> hand, you don't want to run out of disk space while writing WAL -
>> that can lead to a PANIC in the worst case.
>
>
>> What sort of behavior are you expecting to come out of this? In testing,
>> I didn't see much of a difference. Although I'd expect that this would
>> actually preallocate fewer segments than the old formula.
>
>
> For example, if you set checkpoint_segments to 200, and you temporarily
> generate 100 segments of WAL during an initial data load, but the normal
> workload generates only 20 segments between checkpoints. Without the patch,
> you will permanently have about 120 segments in pg_xlog, created by the
> spike. With the patch, the extra segments will be gradually removed after
> the data load, down to the level needed by the constant workload. That would
> be about 50 segments, assuming the default checkpoint_completion_target=0.5.
>
>
> Here's a bigger patch, which does more. It is based on the ideas in the post
> I started this thread with, with feedback incorporated from the long
> discussion. With this patch, WAL disk space usage is controlled by two GUCs:
>
> min_recycle_wal_size
> checkpoint_wal_size

I think it will be helpful for users to configure using wal size
rather than by number of segments and
your idea to keep WAL size under control can be helpful to users.

> These GUCs act as soft minimum and maximum on overall WAL size. At each
> checkpoint, the checkpointer removes enough old WAL files to keep pg_xlog
> usage below checkpoint_wal_size, and recycles enough new WAL files to reach
> min_recycle_wal_size. Between those limits, there is a self-tuning mechanism
> to recycle just enough WAL files to get to end of the next checkpoint
> without running out of preallocated WAL files. To estimate how many files
> are needed for that, a moving average of how much WAL is generated between
> checkpoints is calculated. The moving average is updated with "fast-rise
> slow-decline" behavior, to cater for peak rather than true average use to
> some extent.
>
> As today, checkpoints are triggered based on time or WAL usage, whichever
> comes first. WAL-based checkpoints are triggered based on the good old
> formula: CheckPointSegments = (checkpoint_max_wal_size / (2.0 +
> checkpoint_completion_target)) / 16MB. CheckPointSegments controls that like
> before, but it is now an internal variable derived from checkpoint_wal_size,
> not visible to users.

a.
In XLogFileInit(),
/*
! * XXX: What should we use as max_segno? We used to use XLOGfileslop when
! * that was a constant, but that was always a bit dubious: normally, at a
! * checkpoint, XLOGfileslop was the offset from the checkpoint record,
! * but here, it was the offset from the insert location. We can't do the
! * normal XLOGfileslop calculation here because we don't have access to
! * the prior checkpoint's redo location. So somewhat arbitrarily, just
! * use CheckPointSegments.
! */
! max_segno = logsegno + CheckPointSegments;
if (!InstallXLogFileSegment(&installed_segno, tmppath,
! *use_existent, max_segno,
use_lock))

Earlier max_advance is same when InstallXLogFileSegment is called from
RemoveOldXlogFiles() and XLogFileInit(),
but now they will be different (and it seems there is no direct
relation between these 2 numbers), so will it be okay for scenario
when someone else has created the file while this function was
filling, because it needs to restore as future segment which will be
decided based on max_segno?

b. Do createrestartpoint need to update the
CheckPointDistanceEstimate, as when it will try to remove old xlog
files, it needs recycleSegNo which is calculated using
CheckPointDistanceEstimate?

c. New variables are not present in postgresql.conf after initdb.

> These settings are fairly intuitive for a DBA to tune. You begin by figuring
> out how much disk space you can afford to spend on WAL, and set
> checkpoint_wal_size to that (with some safety margin, of course). Then you
> set checkpoint_timeout based on how long you're willing to wait for recovery
> to finish. Finally, if you have infrequent batch jobs that need a lot more
> WAL than the system otherwise needs, you can set min_recycle_wal_size to
> keep enough WAL preallocated for the spikes.
>
> You can also set min_recycle_wal_size = checkpoint_wal_size, which gets you
> the same behavior as without the patch, except that it's more intuitive to
> set it in terms of "MB of WAL space required", instead of "# of segments
> between checkpoints".
>
> Does that make sense? I'd love to hear feedback on how people setting up
> production databases would like to tune these things. The reason for the
> auto-tuning between the min and max is to be able to set reasonable defaults
> e.g for embedded systems that don't have a DBA to do tuning. Currently, it's
> very difficult to come up with a reasonable default value for
> checkpoint_segments which would work well for a wide range of systems. The
> PostgreSQL default of 3 is way way too low for most systems. On the other
> hand, if you set it to, say, 20, that's a lot of wasted space for a small
> database that's not updated much. With this patch, you can set
> "max_wal_size=1GB" and if the database ends up actually only needing 100 MB
> of WAL, it will only use that much and not waste 900 MB for useless
> preallocated WAL files.

As a developer, I would love to have configuration knob such as
min_recycle_wal_size, but not sure how many users will be comfortable
setting this value, actually few users I had talked about this earlier
are interested in setting max WAL size which can allow them to set an
upper limit on space required by WAL.
Can't we think of doing the calculation of files to recycle only based
on CheckPointDistanceEstimate.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Patrick Dung 2013-09-01 09:33:33 Re: several questions about pg_dumpall, pg_start_backup, pg_basebackup and WAL
Previous Message Peter Geoghegan 2013-09-01 06:02:46 Re: INSERT...ON DUPLICATE KEY IGNORE