Re: Redesigning checkpoint_segments

From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Redesigning checkpoint_segments
Date: 2013-06-06 03:20:22
Message-ID: 51AFFFF6.6090902@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 06/05/2013 05:37 PM, Robert Haas wrote:

> - If it looks like we're going to exceed limit #3 before the
> checkpoint completes, we start exerting back-pressure on writers by
> making them wait every time they write WAL, probably in proportion to
> the number of bytes written. We keep ratcheting up the wait until
> we've slowed down writers enough that will finish within limit #3. As
> we reach limit #3, the wait goes to infinity; only read-only
> operations can proceed until the checkpoint finishes.

Alright, perhaps I am dense. I have read both this thread and the other
one on better handling of archive command
(http://www.postgresql.org/message-id/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ@mail.gmail.com).
I recognize there are brighter minds than mine on this thread but I just
honestly don't get it.

1. WAL writes are already fast. They are the fastest write we have
because it is sequential.

2. We don't want them to be slow. We want data written to disk as
quickly as possible without adversely affecting production. That's the
point.

3. The spread checkpoints have always confused me. If anything we want a
checkpoint to be fast and short because:

4. Bgwriter. We should be adjusting bgwriter so that it is writing
everything in a manner that allows any checkpoint to be in the range of
never noticed.

Now perhaps my customers workloads are different but for us:

1. Checkpoint timeout is set as high as reasonable, usually 30 minutes
to an hour. I wish I could set them even further out.

2. Bgwriter is set to be aggressive but not obtrusive. Usually adjusting
based on an actual amount of IO bandwidth it may take per second based
on their IO constraints. (Note I know that wal_writer comes into play
here but I honestly don't remember where and am reading up on it to
refresh my memory).

3. The biggest issue we see with checkpoint segments is not running out
of space because really.... 10GB is how many checkpoint segments? It is
with wal_keep_segments. If we don't want to fill up the pg_xlog
directory, put the wal logs that are for keep_segments elsewhere.

Other oddities:

Yes checkpoint_segments is awkward. We shouldn't have to set it at all.
It should be gone. Basically we start with X amount perhaps to be set at
initdb time. That X amount changes dynamically based on the amount of
data being written. In order to not suffer from recycling and creation
penalties we always keep X+N where N is enough to keep up with new data.

Along with the above, I don't see any reason for checkpoint_timeout.
Because of bgwriter we should be able to rather indefinitely not worry
about checkpoints (with a few exceptions such as pg_start_backup()).
Perhaps a setting that causes a checkpoint to happen based on some
non-artificial threshold (timeout) such as amount of data currently in
need of a checkpoint?

Heikki said, "I propose that we do something similar, but not exactly
the same. Let's have a setting, max_wal_size, to control the max. disk
space reserved for WAL. Once that's reached (or you get close enough, so
that there are still some segments left to consume while the checkpoint
runs), a checkpoint is triggered.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles. "

This makes sense except I don't see a need for the parameter. Why not
just specify how the algorithm works and adhere to that without the need
for another GUC? Perhaps at any given point we save 10% of available
space (within a 16MB calculation) for pg_xlog, you hit it, we checkpoint
and LOG EXACTLY WHY.

Instead of "running out of disk space PANIC" we should just write to an
emergency location within PGDATA and log very loudly that the SA isn't
paying attention. Perhaps if that area starts to get to an unhappy place
we immediately bounce into read-only mode and log even more loudly that
the SA should be fired. I would think read-only mode is safer and more
polite than an PANIC crash.

I do not think we should worry about filling up the hard disk except to
protect against data loss in the event. It is not user unfriendly to
assume that a user will pay attention to disk space. Really?

Open to people telling me I am off in left field. Sorry if it is noise.

Sincerely,

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua D. Drake 2013-06-06 03:23:43 Re: Redesigning checkpoint_segments
Previous Message Peter Eisentraut 2013-06-06 02:13:45 Re: Make targets of doc links used by phpPgAdmin static