Re: Redesigning checkpoint_segments

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Redesigning checkpoint_segments
Date: 2013-06-06 18:41:49
Message-ID: CAMkU=1wT1NXLA=Bt9L1rnpA4cT3T_GNb1rRvKN-kFDw0QCNbcA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 5, 2013 at 8:20 PM, Joshua D. Drake <jd(at)commandprompt(dot)com>wrote:

>
> On 06/05/2013 05:37 PM, Robert Haas wrote:
>
> - If it looks like we're going to exceed limit #3 before the
>> checkpoint completes, we start exerting back-pressure on writers by
>> making them wait every time they write WAL, probably in proportion to
>> the number of bytes written. We keep ratcheting up the wait until
>> we've slowed down writers enough that will finish within limit #3. As
>> we reach limit #3, the wait goes to infinity; only read-only
>> operations can proceed until the checkpoint finishes.
>>
>
> Alright, perhaps I am dense. I have read both this thread and the other
> one on better handling of archive command (http://www.postgresql.org/**
> message-id/CAM3SWZQcyNxvPaskr-**pxm8DeqH7_**qevW7uqbhPCsg1FpSxKpoQ(at)mail(dot)**
> gmail.com<http://www.postgresql.org/message-id/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com>).
> I recognize there are brighter minds than mine on this thread but I just
> honestly don't get it.
>
> 1. WAL writes are already fast. They are the fastest write we have because
> it is sequential.
>
> 2. We don't want them to be slow. We want data written to disk as quickly
> as possible without adversely affecting production. That's the point.
>

If speed of archiving is the fundamental bottleneck on the system, how does
that bottleneck get communicated forward to the user? PANICs are a
horrible way of doing it, throttling the writing of WAL (and hence the
acceptance of COMMITs) seems like a reasonable alternative . Maybe speed
of archiving is not the fundamental bottleneck on your systems, but...

>
> 3. The spread checkpoints have always confused me. If anything we want a
> checkpoint to be fast and short because:
>
> 4. Bgwriter. We should be adjusting bgwriter so that it is writing
> everything in a manner that allows any checkpoint to be in the range of
> never noticed.
>

They do different things. One writes buffers out to make room for incoming
ones. One writes them out (and fsyncs the underlying files) to allow redo
pointer to advance (limiting soft recovery time) and xlogs to be recycled
(limiting disk space).

>
> Now perhaps my customers workloads are different but for us:
>
> 1. Checkpoint timeout is set as high as reasonable, usually 30 minutes to
> an hour. I wish I could set them even further out.
>

Yeah, I think the limit of 1 hr is rather nanny-ish. I know what I'm
doing, and I want the freedom to go longer if that is what I want to do.

> 2. Bgwriter is set to be aggressive but not obtrusive. Usually adjusting
> based on an actual amount of IO bandwidth it may take per second based on
> their IO constraints. (Note I know that wal_writer comes into play here but
> I honestly don't remember where and am reading up on it to refresh my
> memory).
>

I find bgwriter to be almost worthless, at least since the fsync queue
compaction code went in. When io is free-flowing the kernel accepts writes
almost instantaneously, and so the backends can write out dirty buffers
themselves very quickly and it is not worth off-loading to a background
process. When IO is constipated, it would be worth off-loading except in
those circumstances the bgwriter cannot possibly keep up.

>
> 3. The biggest issue we see with checkpoint segments is not running out of
> space because really.... 10GB is how many checkpoint segments? It is with
> wal_keep_segments. If we don't want to fill up the pg_xlog directory, put
> the wal logs that are for keep_segments elsewhere.
>

Which is what archiving does. But then you have a to put a lot of thought
into how to clean up the archive, assuming your policy is not to keep it
forever. keep_segments can be a nice compromise.

>
> Other oddities:
>
> Yes checkpoint_segments is awkward. We shouldn't have to set it at all. It
> should be gone. Basically we start with X amount perhaps to be set at
> initdb time. That X amount changes dynamically based on the amount of data
> being written. In order to not suffer from recycling and creation penalties
> we always keep X+N where N is enough to keep up with new data.
>
> Along with the above, I don't see any reason for checkpoint_timeout.
> Because of bgwriter we should be able to rather indefinitely not worry
> about checkpoints (with a few exceptions such as pg_start_backup()).
> Perhaps a setting that causes a checkpoint to happen based on some
> non-artificial threshold (timeout) such as amount of data currently in need
> of a checkpoint?
>

Without checkpoints, how would the redo pointer ever advance?

If the system is io limited during recovery, then checkpoint_segments is a
fairly natural way to put a limit on how long recovery from a soft crash
will take. If the system is CPU limited during recovery, then
checkpoint_timeout is a fairly natural way to put a limit on how long
recovery will take. It is probably possible to come with a single merged
setting that is better than both of those in almost all circumstances, but
how much work would that take to get right?

...

Instead of "running out of disk space PANIC" we should just write to an
> emergency location within PGDATA and log very loudly that the SA isn't
> paying attention.

If the SA isn't paying attention, who is it that we are loudly saying these
things to?

If whatever caused archiving to break also caused the archiving failure
emails to not be delivered, about the only way you can get louder is by
refusing new requests from the end user.

> Perhaps if that area starts to get to an unhappy place we immediately
> bounce into read-only mode and log even more loudly that the SA should be
> fired. I would think read-only mode is safer and more polite than an PANIC
> crash.
>

Isn't that effectively what throttling WAL writing is?

Cheers,

Jeff

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2013-06-06 18:48:20 Re: MVCC catalog access
Previous Message Josh Berkus 2013-06-06 17:24:13 Re: Redesigning checkpoint_segments