Load Distributed Checkpoints, revised patch

Lists: pgsql-hackerspgsql-patches
From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Load Distributed Checkpoints, revised patch
Date: 2007-06-15 10:34:22
Message-ID: 46726B2E.7060606@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Here's an updated WIP version of the LDC patch. I just spreads the
writes, that achieves the goal of smoothing the checkpoint I/O spikes. I
think sorting the writes etc. is interesting but falls in the category
of further development and should be pushed to 8.4.

The documentation changes are not complete, GUC variables need
descriptions, and some of the DEBUG elogs will go away in favor of the
separate checkpoint logging patch that's in the queue. I'm fairly happy
with the code now, but there's a few minor open issues:

- What units should we use for the new GUC variables? From
implementation point of view, it would be simplest if
checkpoint_write_rate is given as pages/bgwriter_delay, similarly to
bgwriter_*_maxpages. I never liked those *_maxpages settings, though, a
more natural unit from users perspective would be KB/s.

- The signaling between RequestCheckpoint and bgwriter is a bit tricky.
Bgwriter now needs to deal immediate checkpoint requests, like those
coming from explicit CHECKPOINT or CREATE DATABASE commands, differently
from those triggered by checkpoint_segments. I'm afraid there might be
race conditions when a CHECKPOINT is issued at the same instant as
checkpoint_segments triggers one. What might happen then is that the
checkpoint is performed lazily, spreading the writes, and the CHECKPOINT
command has to wait for that to finish which might take a long time. I
have not been able to convince myself neither that the race condition
exists or that it doesn't.

A few notes about the implementation:

- in bgwriter loop, CheckArchiveTimeout always calls time(NULL), while
previously it used the value returned by another call earlier in the
same codepath. That means we now call time(NULL) twice instead of once
per bgwriter iteration, when archive_timout is set. That doesn't seem
significant to me, so I didn't try to optimize it.

- because of a small change in the meaning of force_checkpoint flag in
bgwriter loop, checkpoints triggered by reaching checkpoint_segments
call CreateCheckPoint(false, false) instead of CreateCheckPoint(false,
true). That second argument is the "force"-flag. If it's false,
CreateCheckPoint skips the checkpoint if there's been no WAL activity
since last checkpoint. It doesn't matter in this case, there surely has
been WAL activity if we reach checkpoint_segments, and doing the check
isn't that expensive.

- to coordinate the writes with with checkpoint_segments, we need to
read the WAL insertion location. To do that, we need to acquire the
WALInsertLock. That means that in the worst case, WALInsertLock is
acquired every bgwriter_delay when a checkpoint is in progress. I don't
think that's a problem, it's only held for a very short duration, but I
thought I'd mention it.

- How should we deal with changing GUC variables that affect LDC, on the
fly when a checkpoint is in progress? The attached patch finishes the
in-progress checkpoint ASAP, and reloads the config after that. We could
reload the config immediately, but making the new settings effective
immediately is not trivial.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
ldc-justwrites-2.patch text/x-diff 32.3 KB

From: Michael Paesold <mpaesold(at)gmx(dot)at>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Load Distributed Checkpoints, revised patch
Date: 2007-06-15 11:11:57
Message-ID: 467273FD.1050102@gmx.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> Here's an updated WIP version of the LDC patch. I just spreads the
> writes, that achieves the goal of smoothing the checkpoint I/O spikes. I
> think sorting the writes etc. is interesting but falls in the category
> of further development and should be pushed to 8.4.

Why do you think so? Is it too much risk to adapt the sorted writes? The
numbers shown by ITAGAKI Takahiro looked quite impressive, at least for
large shared_buffers configurations. The reactions where rather
positive, too.

In general, I am hoping that this patch, together with "Automatic
adjustment of bgwriter_lru_maxpages" will finally make default
postgresql configurations experience much less impact from check points.
For my tast, postgresql has recently got way to many nobs which one must
tweak by hand... I welcome any approach on auto-tuning (and auto vacuum!).

Patch status says "waiting on update from author":
http://archives.postgresql.org/pgsql-patches/2007-04/msg00331.php
Any updates on this?

Best Regards
Michael Paesold


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-15 18:47:02
Message-ID: 20070615184702.GR8313@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:

> - The signaling between RequestCheckpoint and bgwriter is a bit tricky.
> Bgwriter now needs to deal immediate checkpoint requests, like those
> coming from explicit CHECKPOINT or CREATE DATABASE commands, differently
> from those triggered by checkpoint_segments. I'm afraid there might be
> race conditions when a CHECKPOINT is issued at the same instant as
> checkpoint_segments triggers one. What might happen then is that the
> checkpoint is performed lazily, spreading the writes, and the CHECKPOINT
> command has to wait for that to finish which might take a long time. I
> have not been able to convince myself neither that the race condition
> exists or that it doesn't.

Isn't it just a matter of having a flag to tell whether the checkpoint
should be quick or spread out, and have a command set the flag if a
checkpoint is already running?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Michael Paesold <mpaesold(at)gmx(dot)at>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCHES] Load Distributed Checkpoints, revised patch
Date: 2007-06-15 19:31:41
Message-ID: 4672E91D.4060207@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Michael Paesold wrote:
> Heikki Linnakangas wrote:
>> Here's an updated WIP version of the LDC patch. I just spreads the
>> writes, that achieves the goal of smoothing the checkpoint I/O spikes.
>> I think sorting the writes etc. is interesting but falls in the
>> category of further development and should be pushed to 8.4.
>
> Why do you think so? Is it too much risk to adapt the sorted writes? The
> numbers shown by ITAGAKI Takahiro looked quite impressive, at least for
> large shared_buffers configurations. The reactions where rather
> positive, too.

Well, it is a very recent idea, and it's not clear that it's a win under
all circumstances. Adopting that would need more testing, and at this
late stage I'd like to just wrap up what we have and come back to this
idea for 8.4.

If someone performs the tests with different hardware and workloads, I
would be willing to consider it; the patch is still at an early stage
but it's very isolated change and should therefore be easy to review.
But if someone has the hardware and time perform those tests, I'd like
them to perform more testing with just LDC first. As Josh pointed out,
it would be good to test it with more oddball workloads, all the tests
I've done this far have been with DBT-2.

> In general, I am hoping that this patch, together with "Automatic
> adjustment of bgwriter_lru_maxpages" will finally make default
> postgresql configurations experience much less impact from check points.
> For my tast, postgresql has recently got way to many nobs which one must
> tweak by hand... I welcome any approach on auto-tuning (and auto vacuum!).

Sure, but that's another topic.

> Patch status says "waiting on update from author":
> http://archives.postgresql.org/pgsql-patches/2007-04/msg00331.php
> Any updates on this?

No. I'm not actually clear what we're waiting for and from whom; I know
I haven't had the time to review that in detail yet. IIRC we've seen two
very similar patches in the discussions, one from Itagaki-san, and one
from Greg Smith. I'm not sure which one of them we should use, they both
implement roughly the same thing. But the biggest thing needed for that
patch is testing with different workloads.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-15 19:49:58
Message-ID: 4672ED66.8090606@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>
>> - The signaling between RequestCheckpoint and bgwriter is a bit tricky.
>> Bgwriter now needs to deal immediate checkpoint requests, like those
>> coming from explicit CHECKPOINT or CREATE DATABASE commands, differently
>> from those triggered by checkpoint_segments. I'm afraid there might be
>> race conditions when a CHECKPOINT is issued at the same instant as
>> checkpoint_segments triggers one. What might happen then is that the
>> checkpoint is performed lazily, spreading the writes, and the CHECKPOINT
>> command has to wait for that to finish which might take a long time. I
>> have not been able to convince myself neither that the race condition
>> exists or that it doesn't.
>
> Isn't it just a matter of having a flag to tell whether the checkpoint
> should be quick or spread out, and have a command set the flag if a
> checkpoint is already running?

Hmm. Thinking about this some more, the core problem is that when
starting the checkpoint, bgwriter needs to read and clear the flag.
Which is not atomic, as the patch stands.

I think we already have a race condition with ckpt_time_warn. The code
to test and clear the flag does this:

> if (BgWriterShmem->ckpt_time_warn && elapsed_secs < CheckPointWarning)
> ereport(LOG,
> (errmsg("checkpoints are occurring too frequently (%d seconds apart)",
> elapsed_secs),
> errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
> BgWriterShmem->ckpt_time_warn = false;

In the extremely unlikely event that RequestCheckpoint sets
ckpt_time_warn right before it's cleared, after the test in the
if-statement, the warning is missed. That's a very harmless and
theoretical event, you'd have to run CHECKPOINT (or another command that
triggers a checkpoint) at the same instant that an xlog switch triggers
one, and all that happens is that you don't get the message in the log
while you should. So this is not something to worry about in this case,
but it would be more severe if we had the same problem in deciding if a
checkpoint should be spread out or not.

I think we just have to protect those signaling flags with a lock. It's
not like it's on a critical path, and though we don't know what locks
the callers to RequestCheckpoint hold, as long as we don't acquire any
other locks while holding the new proposed lock, there's no danger of
deadlocks.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-15 20:22:46
Message-ID: 20070615202246.GA21399@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> Alvaro Herrera wrote:

> > if (BgWriterShmem->ckpt_time_warn && elapsed_secs <
> > CheckPointWarning)
> > ereport(LOG,
> > (errmsg("checkpoints are occurring too
> > frequently (%d seconds apart)",
> > elapsed_secs),
> > errhint("Consider increasing the
> > configuration parameter
> > \"checkpoint_segments\".")));
> > BgWriterShmem->ckpt_time_warn = false;
>
> In the extremely unlikely event that RequestCheckpoint sets
> ckpt_time_warn right before it's cleared, after the test in the
> if-statement, the warning is missed.

I think this code should look like this:

if (BgWriterShmem->ckpt_time_warn)
{
BgWriterShmem->chpt_time_warn = false;
if (elapsed_secs < CheckPointWarning)
ereport(LOG,
(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
elapsed_secs),
errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
}

That way seems safer. (I am assuming that a process other than the
bgwriter is able to set the ckpt_time_warn bit; otherwise there is no
point). This is also used in pmsignal.c. Of course, as you say, this
is probably very harmless, but in the other case it is important to get
it right.

--
Alvaro Herrera http://www.PlanetPostgreSQL.org/
"Hackers share the surgeon's secret pleasure in poking about in gross innards,
the teenager's secret pleasure in popping zits." (Paul Graham)


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-16 09:10:16
Message-ID: 1181985016.17734.67.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Fri, 2007-06-15 at 11:34 +0100, Heikki Linnakangas wrote:

> - What units should we use for the new GUC variables? From
> implementation point of view, it would be simplest if
> checkpoint_write_rate is given as pages/bgwriter_delay, similarly to
> bgwriter_*_maxpages. I never liked those *_maxpages settings, though, a
> more natural unit from users perspective would be KB/s.

checkpoint_maxpages would seem like a better name; we've already had
those _maxpages settings for 3 releases, so changing that is not really
an option (at so late a stage). We don't really care about units because
the way you use it is to nudge it up a little and see if that works
etc..

Can we avoid having another parameter? There must be some protection in
there to check that a checkpoint lasts for no longer than
checkpoint_timeout, so it makes most sense to vary the checkpoint in
relation to that parameter.

> - The signaling between RequestCheckpoint and bgwriter is a bit tricky.
> Bgwriter now needs to deal immediate checkpoint requests, like those
> coming from explicit CHECKPOINT or CREATE DATABASE commands, differently
> from those triggered by checkpoint_segments. I'm afraid there might be
> race conditions when a CHECKPOINT is issued at the same instant as
> checkpoint_segments triggers one. What might happen then is that the
> checkpoint is performed lazily, spreading the writes, and the CHECKPOINT
> command has to wait for that to finish which might take a long time. I
> have not been able to convince myself neither that the race condition
> exists or that it doesn't.

Is there a mechanism for requesting immediate/non-immediate checkpoints?

pg_start_backup() should be a normal checkpoint I think. No need for
backup to be an intrusive process.

> - to coordinate the writes with with checkpoint_segments, we need to
> read the WAL insertion location. To do that, we need to acquire the
> WALInsertLock. That means that in the worst case, WALInsertLock is
> acquired every bgwriter_delay when a checkpoint is in progress. I don't
> think that's a problem, it's only held for a very short duration, but I
> thought I'd mention it.

I think that is a problem. Do we need to know it so exactly that we look
at WALInsertLock? Maybe use info_lck to request the latest page, since
that is less heavily contended and we need never wait across I/O.

> - How should we deal with changing GUC variables that affect LDC, on the
> fly when a checkpoint is in progress? The attached patch finishes the
> in-progress checkpoint ASAP, and reloads the config after that. We could
> reload the config immediately, but making the new settings effective
> immediately is not trivial.

No need to do this during a checkpoint, there'll be another along
shortly anyhow.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Michael Paesold" <mpaesold(at)gmx(dot)at>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Patches" <pgsql-patches(at)postgresql(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] Load Distributed Checkpoints, revised patch
Date: 2007-06-16 09:14:46
Message-ID: 1181985286.17734.72.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Fri, 2007-06-15 at 13:11 +0200, Michael Paesold wrote:
> Heikki Linnakangas wrote:
> > Here's an updated WIP version of the LDC patch. I just spreads the
> > writes, that achieves the goal of smoothing the checkpoint I/O spikes. I
> > think sorting the writes etc. is interesting but falls in the category
> > of further development and should be pushed to 8.4.
>
> Why do you think so? Is it too much risk to adapt the sorted writes? The
> numbers shown by ITAGAKI Takahiro looked quite impressive, at least for
> large shared_buffers configurations. The reactions where rather
> positive, too.

Agreed.

Seems like a simple isolated piece of code to include or not. If we
controlled it with a simple boolean parameter, that would allow testing
during beta - I agree it needs more testing.

If we find a way of automating it, cool. If its flaky, strip it out
before we release.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-16 15:02:33
Message-ID: 11092.1182006153@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> checkpoint_maxpages would seem like a better name; we've already had
> those _maxpages settings for 3 releases, so changing that is not really
> an option (at so late a stage).

Sure it is. We've never promised stability of obscure tuning settings.
For something as commonly set from-an-application as, say, work_mem
(nee sort_mem), it's worth worrying about backward compatibility.
But not for things that in practice would only be set from
postgresql.conf. It's never been possible to just copy your old .conf
file without any thought when moving to a new release.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-16 16:46:12
Message-ID: 1182012372.17734.117.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Sat, 2007-06-16 at 11:02 -0400, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > checkpoint_maxpages would seem like a better name; we've already had
> > those _maxpages settings for 3 releases, so changing that is not really
> > an option (at so late a stage).
>
> Sure it is.

Maybe, but we need a proposal before we agree to change anything.

If we can get rid of them, good, but if we can't I'd like to see a clear
posting of the parameters, their meaning and suggested names before we
commit this, please.

BTW, checkpoint_write_percent actually does what I wanted, but I wasn't
able to guess that from the name. checkpoint_duration_percent?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-17 07:51:35
Message-ID: 4674E807.3050805@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs wrote:
> On Fri, 2007-06-15 at 11:34 +0100, Heikki Linnakangas wrote:
>
>> - What units should we use for the new GUC variables? From
>> implementation point of view, it would be simplest if
>> checkpoint_write_rate is given as pages/bgwriter_delay, similarly to
>> bgwriter_*_maxpages. I never liked those *_maxpages settings, though, a
>> more natural unit from users perspective would be KB/s.
>
> checkpoint_maxpages would seem like a better name; we've already had
> those _maxpages settings for 3 releases, so changing that is not really
> an option (at so late a stage).

As Tom pointed out, we don't promise compatibility of conf-files over
major releases. I wasn't actually thinking of changing any of the
existing parameters, just thinking about the best name and behavior for
the new ones.

> We don't really care about units because
> the way you use it is to nudge it up a little and see if that works
> etc..

Not necessarily. If it's given in KB/s, you might very well have an idea
of how much I/O your hardware is capable of, and set aside a fraction of
that for checkpoints.

> Can we avoid having another parameter? There must be some protection in
> there to check that a checkpoint lasts for no longer than
> checkpoint_timeout, so it makes most sense to vary the checkpoint in
> relation to that parameter.

Sure, that's what checkpoint_write_percent is for. checkpoint_rate can
be used to finish the checkpoint faster, if there's not much work to do.
For example, if there's only 10 pages to flush in a checkpoint,
checkpoint_timeout is 30 minutes and checkpoint_write_percent = 50%, you
don't want to spread out those 10 writes over 15 minutes, that would be
just silly. checkpoint_rate sets the *minimum* rate used to write. If
writing at that minimum rate isn't enough to finish the checkpoint in
time, as defined by by checkpoint interval * checkpoint_write_percent,
we write more aggressively.

I'm more interested in checkpoint_write_percent myself as well, but Greg
Smith said he wanted the checkpoint to use a constant I/O rate and let
the length of the checkpoint to vary.

>> - The signaling between RequestCheckpoint and bgwriter is a bit tricky.
>> Bgwriter now needs to deal immediate checkpoint requests, like those
>> coming from explicit CHECKPOINT or CREATE DATABASE commands, differently
>> from those triggered by checkpoint_segments. I'm afraid there might be
>> race conditions when a CHECKPOINT is issued at the same instant as
>> checkpoint_segments triggers one. What might happen then is that the
>> checkpoint is performed lazily, spreading the writes, and the CHECKPOINT
>> command has to wait for that to finish which might take a long time. I
>> have not been able to convince myself neither that the race condition
>> exists or that it doesn't.
>
> Is there a mechanism for requesting immediate/non-immediate checkpoints?

No, CHECKPOINT requests an immediate one. Is there a use case for
CHECKPOINT LAZY?

> pg_start_backup() should be a normal checkpoint I think. No need for
> backup to be an intrusive process.

Good point. A spread out checkpoint can take a long time to finish,
though. Is there risk for running into a timeout or something if it
takes say 10 minutes for a call to pg_start_backup to finish?

>> - to coordinate the writes with with checkpoint_segments, we need to
>> read the WAL insertion location. To do that, we need to acquire the
>> WALInsertLock. That means that in the worst case, WALInsertLock is
>> acquired every bgwriter_delay when a checkpoint is in progress. I don't
>> think that's a problem, it's only held for a very short duration, but I
>> thought I'd mention it.
>
> I think that is a problem.

Why?

> Do we need to know it so exactly that we look
> at WALInsertLock? Maybe use info_lck to request the latest page, since
> that is less heavily contended and we need never wait across I/O.

Is there such a value available, that's protected by just info_lck? I
can't see one.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-17 08:39:04
Message-ID: 1182069544.6855.41.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Sun, 2007-06-17 at 08:51 +0100, Heikki Linnakangas wrote:

> > We don't really care about units because
> > the way you use it is to nudge it up a little and see if that works
> > etc..
>
> Not necessarily. If it's given in KB/s, you might very well have an idea
> of how much I/O your hardware is capable of, and set aside a fraction of
> that for checkpoints.

I'm worried that people will think they can calculate the setting
without testing.

I guess with the right caveats in the docs about the need for testing to
ensure the values are suitable for your situation, I can accept KB/s.

> > Can we avoid having another parameter? There must be some protection in
> > there to check that a checkpoint lasts for no longer than
> > checkpoint_timeout, so it makes most sense to vary the checkpoint in
> > relation to that parameter.
>
> Sure, that's what checkpoint_write_percent is for.

Yeh, I didn't understand the name.

> checkpoint_rate can
> be used to finish the checkpoint faster, if there's not much work to do.
> For example, if there's only 10 pages to flush in a checkpoint,
> checkpoint_timeout is 30 minutes and checkpoint_write_percent = 50%, you
> don't want to spread out those 10 writes over 15 minutes, that would be
> just silly. checkpoint_rate sets the *minimum* rate used to write. If
> writing at that minimum rate isn't enough to finish the checkpoint in
> time, as defined by by checkpoint interval * checkpoint_write_percent,
> we write more aggressively.
>
> I'm more interested in checkpoint_write_percent myself as well, but Greg
> Smith said he wanted the checkpoint to use a constant I/O rate and let
> the length of the checkpoint to vary.

Having both parameters is good.

I'm really impressed with the results on the response time graphs.

> >> - The signaling between RequestCheckpoint and bgwriter is a bit tricky.
> >> Bgwriter now needs to deal immediate checkpoint requests, like those
> >> coming from explicit CHECKPOINT or CREATE DATABASE commands, differently
> >> from those triggered by checkpoint_segments. I'm afraid there might be
> >> race conditions when a CHECKPOINT is issued at the same instant as
> >> checkpoint_segments triggers one. What might happen then is that the
> >> checkpoint is performed lazily, spreading the writes, and the CHECKPOINT
> >> command has to wait for that to finish which might take a long time. I
> >> have not been able to convince myself neither that the race condition
> >> exists or that it doesn't.
> >
> > Is there a mechanism for requesting immediate/non-immediate checkpoints?
>
> No, CHECKPOINT requests an immediate one. Is there a use case for
> CHECKPOINT LAZY?

I meant via the CreateCheckpoint API etc.

> > pg_start_backup() should be a normal checkpoint I think. No need for
> > backup to be an intrusive process.
>
> Good point. A spread out checkpoint can take a long time to finish,
> though. Is there risk for running into a timeout or something if it
> takes say 10 minutes for a call to pg_start_backup to finish?

That would be annoying, but the alternative is for backups to seriously
effect performance, which would defeat the object of the HOT backup.
It's not like its immediate right now, so we'd probably be moving from
2-3 mins to 10 mins in your example. Most people are expecting their
backups to take a long time anyway, so thats OK.

> > Do we need to know it so exactly that we look
> > at WALInsertLock? Maybe use info_lck to request the latest page, since
> > that is less heavily contended and we need never wait across I/O.
>
> Is there such a value available, that's protected by just info_lck? I
> can't see one.

XLogCtl->LogwrtRqst.Write

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-17 11:00:02
Message-ID: 46751432.2010300@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs wrote:
> On Sun, 2007-06-17 at 08:51 +0100, Heikki Linnakangas wrote:
>>> Do we need to know it so exactly that we look
>>> at WALInsertLock? Maybe use info_lck to request the latest page, since
>>> that is less heavily contended and we need never wait across I/O.
>> Is there such a value available, that's protected by just info_lck? I
>> can't see one.
>
> XLogCtl->LogwrtRqst.Write

That's the Write location. checkpoint_segments is calculated against the
Insert location. In a normal OLTP scenario they would be close to each
other, but if you're doing a huge data load in a transaction; restoring
from backup for example, they could be really far apart.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-17 14:33:45
Message-ID: 1182090826.6855.48.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Sun, 2007-06-17 at 12:00 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Sun, 2007-06-17 at 08:51 +0100, Heikki Linnakangas wrote:
> >>> Do we need to know it so exactly that we look
> >>> at WALInsertLock? Maybe use info_lck to request the latest page, since
> >>> that is less heavily contended and we need never wait across I/O.
> >> Is there such a value available, that's protected by just info_lck? I
> >> can't see one.
> >
> > XLogCtl->LogwrtRqst.Write
>
> That's the Write location. checkpoint_segments is calculated against the
> Insert location. In a normal OLTP scenario they would be close to each
> other, but if you're doing a huge data load in a transaction; restoring
> from backup for example, they could be really far apart.

XLogCtl->LogwrtRqst.Write is updated every time we insert an xlog record
that advances to a new page. It isn't exactly up to date, but it lags
behind by no more than a page.

LogwrtRqst and LogwrtResult may differ substantially in the situation
you mention.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-17 15:40:40
Message-ID: 467555F8.9040109@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Simon Riggs wrote:
> XLogCtl->LogwrtRqst.Write is updated every time we insert an xlog record
> that advances to a new page. It isn't exactly up to date, but it lags
> behind by no more than a page.

Oh, ok. That would work just fine then.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Michael Paesold <mpaesold(at)gmx(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-18 05:50:57
Message-ID: 20070618111319.6E47.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:

> Michael Paesold wrote:
> > Why do you think so? Is it too much risk to adapt the sorted writes? The
> > numbers shown by ITAGAKI Takahiro looked quite impressive, at least for
> > large shared_buffers configurations. The reactions where rather
> > positive, too.
>
> Well, it is a very recent idea, and it's not clear that it's a win under
> all circumstances. Adopting that would need more testing, and at this
> late stage I'd like to just wrap up what we have and come back to this
> idea for 8.4.

I'm planning it for 8.4. Through the discussion of LDC and the sorted
writes, we found that kernels are not so clever as we don't need to
schedule I/O. At least, we'd better to give them more hints.

Many self I/O management methods have been often proposed; read-ahead,
aio, bulk extension of files and so on. The sorted writes is one of them.
We should test them in various workloads and take what is good.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Jim Nasby <decibel(at)decibel(dot)org>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-18 23:44:52
Message-ID: BF27941E-B9BA-47EA-8310-0433CE61618A@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Jun 17, 2007, at 4:39 AM, Simon Riggs wrote:
>>> pg_start_backup() should be a normal checkpoint I think. No need for
>>> backup to be an intrusive process.
>>
>> Good point. A spread out checkpoint can take a long time to finish,
>> though. Is there risk for running into a timeout or something if it
>> takes say 10 minutes for a call to pg_start_backup to finish?
>
> That would be annoying, but the alternative is for backups to
> seriously
> effect performance, which would defeat the object of the HOT backup.
> It's not like its immediate right now, so we'd probably be moving from
> 2-3 mins to 10 mins in your example. Most people are expecting their
> backups to take a long time anyway, so thats OK.

We should document it, though; otherwise I can see a bunch of
confused users wondering why pg_start_backup takes so long. Remember
that with longer checkpoints, the odds of them calling
pg_start_backup during one and having to wait are much greater.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Jim Nasby <decibel(at)decibel(dot)org>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints, revised patch
Date: 2007-06-19 12:04:24
Message-ID: 4677C648.6010401@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Jim Nasby wrote:
> On Jun 17, 2007, at 4:39 AM, Simon Riggs wrote:
>>>> pg_start_backup() should be a normal checkpoint I think. No need for
>>>> backup to be an intrusive process.
>>>
>>> Good point. A spread out checkpoint can take a long time to finish,
>>> though. Is there risk for running into a timeout or something if it
>>> takes say 10 minutes for a call to pg_start_backup to finish?
>>
>> That would be annoying, but the alternative is for backups to seriously
>> effect performance, which would defeat the object of the HOT backup.
>> It's not like its immediate right now, so we'd probably be moving from
>> 2-3 mins to 10 mins in your example. Most people are expecting their
>> backups to take a long time anyway, so thats OK.
>
> We should document it, though; otherwise I can see a bunch of confused
> users wondering why pg_start_backup takes so long. Remember that with
> longer checkpoints, the odds of them calling pg_start_backup during one
> and having to wait are much greater.

If pg_start_backup initiates a non-immediate, smoothed checkpoint, how
about a checkpoint that's in progress when pg_start_backup is called?
Should that be hurried, so we can start the backup sooner? Probably not,
which means we'll need yet another mode to RequestCheckpoint: request a
non-immediate checkpoint, but if there's a checkpoint already running,
don't rush it.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com