Quick Links

Re: Load Distributed Checkpoints, final patch

Lists:	pgsql-patches

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Load Distributed Checkpoints, final patch
Date:	2007-06-26 18:49:15
Message-ID:	46815FAB.7030205@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Here's latest revision of Itagaki-sans Load Distributed Checkpoints patch:

* bgwriter all-scan is gone. We might or might not improve the LRU-sweep
later so that it can perform any duties the all-sweep might have had
besides reducing the impact of a checkpoint.

* one new GUC variable, called checkpoint_completion_target. Default is
0.5, which should be more than enough to smooth checkpoints on a system
that's not already overloaded. It's also not too large to hurt recovery
times too much on a system that's not already struggling to hit its
recovery time requirements. You can set it to 0 if you want the old
checkpoint behavior for some reason. Maximum is 0.9, to leave some
headroom for fsync and any other things that need to happen during a
checkpoint.

* The minimum rate we write at during a checkpoint is 1 page /
bgwriter_delay.

* Added a paragraph to user manual to describe the feature. Also updated
the formula for expected number of WAL segments, new formula is (2 +
checkpoint_completion_target) * checkpoint_segments + 1. I believe the
comments in xlog.c regarding XLOGfileslop are still valid.

* The signaling in bgwriter.c is based on a spinlock. Tom advised to not
use the spinlock when not strictly necessary, but IMHO it's easier to
understand this way. Feel free to revert that when committing if you
disagree.

* The algorithm for estimating progress wrt. checkpoint_segments is the
same as before. Bursty WAL activity will lead to bursty checkpoint
activity, but I wanted to keep it simple for now. In any case, the I/O
rate will be smoother than without the patch.

* There's some DEBUG elogs which we might want to replace with better
ones later, per the patch in the patch queue by Greg Smith. The ones
that are there now are useful for testing this feature, but are a bit
crude for DBAs to use.

Barring any objections from committer, I'm finished with this patch.

I'm scheduling more DBT-2 tests at a high # of warehouses per Greg
Smith's suggestion just to see what happens, but I doubt that will
change my mind on the above decisions.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
ldc-justwrites-6.patch	text/x-diff	56.4 KB

From:	Michael Glaesemann <grzm(at)seespotcode(dot)net>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-26 19:08:46
Message-ID:	D603BFC6-8E55-4BDE-9AFA-5B5196D5B310@seespotcode.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On Jun 26, 2007, at 13:49 , Heikki Linnakangas wrote:

> Maximum is 0.9, to leave some headroom for fsync and any other
> things that need to happen during a checkpoint.

I think it might be more user-friendly to make the maximum 1 (meaning
as much smoothing as you could possibly get) and internally reserve a
certain amount off for whatever headroom might be required. It's more
common for users to see a value range from 0 to 1 rather than 0 to 0.9.

Michael Glaesemann
grzm seespotcode net

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-26 19:10:57
Message-ID:	26583.1182885057@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Barring any objections from committer, I'm finished with this patch.

Sounds great, I'll start looking this over.

> I'm scheduling more DBT-2 tests at a high # of warehouses per Greg
> Smith's suggestion just to see what happens, but I doubt that will
> change my mind on the above decisions.

When do you expect to have those results?

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-26 19:12:07
Message-ID:	46816507.1070408@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Barring any objections from committer, I'm finished with this patch.
>
> Sounds great, I'll start looking this over.
>
>> I'm scheduling more DBT-2 tests at a high # of warehouses per Greg
>> Smith's suggestion just to see what happens, but I doubt that will
>> change my mind on the above decisions.
>
> When do you expect to have those results?

In a few days. I'm doing long tests because the variability in the 1h
tests was very high.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Michael Glaesemann <grzm(at)seespotcode(dot)net>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-26 19:14:10
Message-ID:	46816582.7060900@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Michael Glaesemann wrote:
>
> On Jun 26, 2007, at 13:49 , Heikki Linnakangas wrote:
>
>> Maximum is 0.9, to leave some headroom for fsync and any other things
>> that need to happen during a checkpoint.
>
> I think it might be more user-friendly to make the maximum 1 (meaning as
> much smoothing as you could possibly get) and internally reserve a
> certain amount off for whatever headroom might be required. It's more
> common for users to see a value range from 0 to 1 rather than 0 to 0.9.

It would then be counter-intuitive if you set it to 1.0, and see that
your checkpoints consistently take 90% of the checkpoint interval.

We could just allow any value up to 1.0, and note in the docs that you
should leave some headroom, unless you don't mind starting the next
checkpoint a bit late. That actually sounds pretty good.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Michael Glaesemann" <grzm(at)seespotcode(dot)net>, "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-26 19:54:49
Message-ID:	87sl8etrsm.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:

> We could just allow any value up to 1.0, and note in the docs that you should
> leave some headroom, unless you don't mind starting the next checkpoint a bit
> late. That actually sounds pretty good.

What exactly happens if a checkpoint takes so long that the next checkpoint
starts. Aside from it not actually helping is there much reason to avoid this
situation? Have we ever actually tested it?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Michael Glaesemann <grzm(at)seespotcode(dot)net>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-26 20:00:54
Message-ID:	46817076.8030007@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Gregory Stark wrote:
> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
>
>> We could just allow any value up to 1.0, and note in the docs that you should
>> leave some headroom, unless you don't mind starting the next checkpoint a bit
>> late. That actually sounds pretty good.
>
> What exactly happens if a checkpoint takes so long that the next checkpoint
> starts. Aside from it not actually helping is there much reason to avoid this
> situation?

Not really. We might run out of preallocated WAL segments, and will have
to create more. Recovery could be longer than expected since the real
checkpoint interval ends up being longer, but you can't make very
accurate recovery time estimations anyway.

> Have we ever actually tested it?

I haven't.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-26 20:35:36
Message-ID:	Pine.GSO.4.64.0706261617400.24678@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On Tue, 26 Jun 2007, Gregory Stark wrote:

> What exactly happens if a checkpoint takes so long that the next checkpoint
> starts. Aside from it not actually helping is there much reason to avoid this
> situation? Have we ever actually tested it?

More segments get created, and because of how they are cleared at the
beginning this causes its own mini-I/O storm through the same buffered
write channel the checkpoint writes are going into (which way or may not
be the same way normal WAL writes go, depending on whether you're using
O_[D]SYNC WAL writes). I've seen some weird and intermittant breakdowns
from the contention that occurs when this happens, and it's certainly
something to be avoided.

To test it you could just use a big buffer cache, write like mad to it,
and make checkpoint_segments smaller than it should be for that workload.
It's easy enough to kill yourself exactly this way right now though, and
the fact that LDC gives you a parameter to aim this particular foot-gun
more precisely isn't a big deal IMHO as long as the documentation is
clear.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Michael Glaesemann <grzm(at)seespotcode(dot)net>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-26 20:44:01
Message-ID:	28277.1182890641@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> We could just allow any value up to 1.0, and note in the docs that you
> should leave some headroom, unless you don't mind starting the next
> checkpoint a bit late. That actually sounds pretty good.

Yeah, that sounds fine. There isn't actually any harm in starting a
checkpoint later than otherwise expected, is there? The worst
consequence I can think of is a backend having to take time to
manufacture a new xlog segment, because we didn't finish a checkpoint
in time to recycle old ones. This might be better handled by allowing
a bit more slop in the number of recycled-into-the-future xlog segments.

Come to think of it, shouldn't we be allowing some extra slop in the
number of future segments to account for xlog archiving delays, when
that's enabled?

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Michael Glaesemann <grzm(at)seespotcode(dot)net>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-26 20:49:29
Message-ID:	46817BD9.1090502@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> We could just allow any value up to 1.0, and note in the docs that you
>> should leave some headroom, unless you don't mind starting the next
>> checkpoint a bit late. That actually sounds pretty good.
>
> Yeah, that sounds fine. There isn't actually any harm in starting a
> checkpoint later than otherwise expected, is there? The worst
> consequence I can think of is a backend having to take time to
> manufacture a new xlog segment, because we didn't finish a checkpoint
> in time to recycle old ones. This might be better handled by allowing
> a bit more slop in the number of recycled-into-the-future xlog segments.
>
> Come to think of it, shouldn't we be allowing some extra slop in the
> number of future segments to account for xlog archiving delays, when
> that's enabled?

XLogFileSlop is currently 2 * checkpoint_segments + 1 since last
checkpoint, which is just enough to accommodate a checkpoint that lasts
the full checkpoint interval. If we want to keep as much "slop" there as
before, then yes that should be increased to (2 +
checkpoint_completion_target) * checkpoint_segments + 1, or just 3 *
checkpoint_segments if we want to keep it simple.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-26 22:25:12
Message-ID:	Pine.GSO.4.64.0706261806590.10548@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On Tue, 26 Jun 2007, Heikki Linnakangas wrote:

> I'm scheduling more DBT-2 tests at a high # of warehouses per Greg Smith's
> suggestion just to see what happens, but I doubt that will change my mind on
> the above decisions.

I don't either, at worst I'd expect a small documentation update perhaps
with some warnings based on what's discovered there. The form you've
added checkpoint_completion_target in is sufficient to address all the
serious concerns I had; I can turn it off, I can smooth just a bit without
increasing recovery time too much, or I can go all out smooth.

Certainly no one should consider waiting for the tests I asked you about a
hurdle to getting this patch committed, slowing that down was never my
intention by bringing that up. I'm just curious to see if anything
scurries out of some the darker corners in this area when they're
illuminated. I'd actually like to see this get committed relatively soon
because there's two interleaved merges stuck behind this one (the more
verbose logging patch and the LRU modifications).

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-28 00:19:48
Message-ID:	17558.1182989988@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Here's latest revision of Itagaki-sans Load Distributed Checkpoints patch:

Applied with some minor revisions to make some of the internal APIs a
bit cleaner; mostly, it seemed like a good idea to replace all those
bool parameters with a flag-bits approach, so that you could have
something like "CHECKPOINT_FORCE | CHECKPOINT_WAIT" instead of
"false, true, true, false" ...

For the moment I removed all the debugging elog's in the patch.
We still have Greg Smith's checkpoint logging patch to look at
(which I suppose needs adjustment now), and that seems like the
appropriate venue to consider what to put in.

Also, the question of redesigning the bgwriter's LRU scan is
still open. I believe that's on Greg's plate, too.

One other closely connected item that might be worth looking at is the
code for creating new future xlog segments (PreallocXlogFiles). Greg
was griping upthread about xlog segment creation being a real
performance drag. I realized that as we currently have it set up, the
checkpoint code is next to useless for high-WAL-volume installations,
because it only considers making *one* future XLOG segment. Once you've
built up enough XLOG segments, the system isn't too bad about recycling
them, but there will be a nasty startup transient where foreground
processes have to stop and make the things. I wonder whether it would
help if we (a) have the bgwriter call PreallocXlogFiles during its
normal loop, and (b) back the slop in PreallocXlogFiles way off, so that
it will make a future segment as soon as we start using the last
existing segment, instead of only when we're nearly done. This would at
least make it more likely that the bgwriter does the work instead of a
foreground process. I'm hesitant to go much further than that, because
I don't want to bloat the minimum disk footprint for low-volume
installations, but the minimum footprint is really 2 xlog files anyway...

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-28 09:14:05
Message-ID:	46837BDD.2000705@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Thanks.

> For the moment I removed all the debugging elog's in the patch.
> We still have Greg Smith's checkpoint logging patch to look at
> (which I suppose needs adjustment now), and that seems like the
> appropriate venue to consider what to put in.

Ok, I'll look at that next.

> One other closely connected item that might be worth looking at is the
> code for creating new future xlog segments (PreallocXlogFiles). Greg
> was griping upthread about xlog segment creation being a real
> performance drag. I realized that as we currently have it set up, the
> checkpoint code is next to useless for high-WAL-volume installations,
> because it only considers making *one* future XLOG segment. Once you've
> built up enough XLOG segments, the system isn't too bad about recycling
> them, but there will be a nasty startup transient where foreground
> processes have to stop and make the things. I wonder whether it would
> help if we (a) have the bgwriter call PreallocXlogFiles during its
> normal loop, and (b) back the slop in PreallocXlogFiles way off, so that
> it will make a future segment as soon as we start using the last
> existing segment, instead of only when we're nearly done. This would at
> least make it more likely that the bgwriter does the work instead of a
> foreground process. I'm hesitant to go much further than that, because
> I don't want to bloat the minimum disk footprint for low-volume
> installations, but the minimum footprint is really 2 xlog files anyway...

That seems like a good idea. It might also become a problem if you have
WAL archiving set up and the archiving falls behind so that existing log
files are not recycled fast enough.

The comment in PreallocXlogFiles is out of date:

> /*
> * Preallocate log files beyond the specified log endpoint, according to
> * the XLOGfile user parameter.
> */

As you pointed out, it only preallocates one log file. And there is no
XLOGfile mentioned anywhere else in the source tree.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-28 13:22:19
Message-ID:	Pine.GSO.4.64.0706280857030.6275@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On Wed, 27 Jun 2007, Tom Lane wrote:

> Also, the question of redesigning the bgwriter's LRU scan is
> still open. I believe that's on Greg's plate, too.

Greg's plate was temporarily fried after his house was hit by lightening
yesterday. I just got everything back on-line again, so no coding
progress, but I think I finished assimilating your "epiphany" during that
time. Now I realize that what you're suggesting is that under healthy
low-load conditions, the LRU really should be able to keep up right behind
the clock sweep point. Noting how far behind it is serves as a
measurement of it failing to match the rate buffers that could be re-used
are being dirtied, and the only question is how fast and far it should try
to drive the point it has cleaned to forward when that happens.

> Once you've built up enough XLOG segments, the system isn't too bad
> about recycling them, but there will be a nasty startup transient where
> foreground processes have to stop and make the things.

Exactly. I found it problematic in four situations:

1) Slow checkpoint doesn't finish in time and new segments are being
created while the checkpoint is also busy (this is the really bad one)

2) Archive logger stop doing anything (usually because the archive disk is
filled) and nothing gets recycled until that's fixed.

2) checkpoint_segments is changed, so then performance is really sluggish
for a bit until all the segments are built back up again

3) You ran an early manual checkpoint which doesn't seem to recycle as
many segments usefully

Any change that would be more proactive about creating segments in these
situations than the current code would be a benefit, even though these are
not common paths people encounter.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-06-28 13:30:12
Message-ID:	2797.1183037412@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> The comment in PreallocXlogFiles is out of date:

Yeah, I changed it yesterday ...

> As you pointed out, it only preallocates one log file. And there is no
> XLOGfile mentioned anywhere else in the source tree.

If memory serves, there once was a variable there, but we simplified it
out of existence for reasons no longer apparent. Possibly it'd be worth
trolling the CVS log and archives to find out why we did that.

Anyway, what I'm thinking at the moment is that it's not so much that
PreallocXlogFiles needs to do more work as that it needs to be called
more often. Right now we only do it once per checkpoint.

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Patches <pgsql-patches(at)postgresql(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-07-02 09:36:31
Message-ID:	4688C71F.9040305@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>>> I'm scheduling more DBT-2 tests at a high # of warehouses per Greg
>>> Smith's suggestion just to see what happens, but I doubt that will
>>> change my mind on the above decisions.
>>
>> When do you expect to have those results?
>
> In a few days. I'm doing long tests because the variability in the 1h
> tests was very high.

I ran two tests with 200 warehouses to see how LDC behaves on a badly
overloaded system, see tests imola-319 and imola-320. Seems to work
quite well. In fact the checkpoint spike is relatively speaking less
severe than with smaller # of warehouses even in the baseline test run,
and LDC smooths it very nicely.

After those two tests, I noticed that I had full_page_writes=off in all
tests performed earlier :(. That throws off the confidence in those
results, so I ran more tests with full_page_writes on and off to compare
the affect. I also wanted to compare the effectiveness of the patch when
checkpoints are triggered by either checkpoint_timeout or
checkpoint_segments.

imola-326 - imola-330 are all configured so that checkpoints happen
roughly on a 50 minute interval. On imola-326, checkpoints are triggered
by checkpoint_segments, and on imola-327 they're triggered by
checkpoint_timeout. On imola-326, the write phase lasts ~7 minutes, and
on imola-327, it lasts ~10 minutes. Because of full_page_writes, a lot
more WAL is consumed right after starting the checkpoint, so we end up
being more aggressive than necessary at the beginning.

For comparison, imola-328 has full_page_writes=off. Checkpoints last ~9
minutes there, and the graphs look very smooth. That suggests that
spreading the writes over a longer time wouldn't make a difference, but
smoothing the rush at the beginning of checkpoint might. I'm going to
try the algorithm I posted, that uses the WAL consumption rate from
previous checkpoint interval in the calculations.

Imola-329 is the same as imola-328, but with updated CVS source tree
instead of older tree + patch. The purpose of this test was basically to
just verify that what was committed works the same as the patch.

Imola-330 is comparable with imola-327, checkpoints are triggered by
timeout and full_page_writes=on. But 330 was patched to to call
PreallocXlogFiles in bgwriter, per Tom's idea. According to logs, most
WAL segments are created by bgwriter in that test, and response times
look slightly better with the patch, though I'm not sure the difference
is statistically significant.

As before, the results are available at
http://community.enterprisedb.com/ldc/

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-07-02 23:44:54
Message-ID:	200707022344.l62Nis604943@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas wrote:
> For comparison, imola-328 has full_page_writes=off. Checkpoints last ~9
> minutes there, and the graphs look very smooth. That suggests that
> spreading the writes over a longer time wouldn't make a difference, but
> smoothing the rush at the beginning of checkpoint might. I'm going to
> try the algorithm I posted, that uses the WAL consumption rate from
> previous checkpoint interval in the calculations.

One thing that concerns me is that checkpoint smoothing happening just
after the checkpoint is causing I/O at the same time that
full_page_writes is causing additional I/O. Ideally we would do the
smoothing toward the end of the checkpoint cycle, but I realize that has
problems of its own.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-07-03 03:26:39
Message-ID:	27757.1183433199@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> Heikki Linnakangas wrote:
>> For comparison, imola-328 has full_page_writes=off. Checkpoints last ~9
>> minutes there, and the graphs look very smooth. That suggests that
>> spreading the writes over a longer time wouldn't make a difference, but
>> smoothing the rush at the beginning of checkpoint might. I'm going to
>> try the algorithm I posted, that uses the WAL consumption rate from
>> previous checkpoint interval in the calculations.

> One thing that concerns me is that checkpoint smoothing happening just
> after the checkpoint is causing I/O at the same time that
> full_page_writes is causing additional I/O.

I'm tempted to just apply some sort of nonlinear correction to the
WAL-based progress measurement. Squaring it would be cheap but is
probably too extreme. Carrying over info from the previous cycle
doesn't seem like it would help much; rather, the point is exactly
that we *don't* want a constant write speed during the checkpoint.

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-07-03 08:39:58
Message-ID:	468A0B5E.9060304@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>> Heikki Linnakangas wrote:
>>> For comparison, imola-328 has full_page_writes=off. Checkpoints last ~9
>>> minutes there, and the graphs look very smooth. That suggests that
>>> spreading the writes over a longer time wouldn't make a difference, but
>>> smoothing the rush at the beginning of checkpoint might. I'm going to
>>> try the algorithm I posted, that uses the WAL consumption rate from
>>> previous checkpoint interval in the calculations.
>
>> One thing that concerns me is that checkpoint smoothing happening just
>> after the checkpoint is causing I/O at the same time that
>> full_page_writes is causing additional I/O.
>
> I'm tempted to just apply some sort of nonlinear correction to the
> WAL-based progress measurement. Squaring it would be cheap but is
> probably too extreme. Carrying over info from the previous cycle
> doesn't seem like it would help much; rather, the point is exactly
> that we *don't* want a constant write speed during the checkpoint.

While thinking about this, I made an observation on full_page_writes.
Currently, we perform a full page write whenever LSN < RedoRecPtr. If
we're clever, we can skip or defer some of the full page writes:

The rule is that when we replay, we need to always replay a full page
image before we apply any regular WAL records on the page. When we begin
a checkpoint, there's two possible outcomes: we crash before the new
checkpoint is finished, and we replay starting from the previous redo
ptr, or we finish the checkpoint successfully, and we replay starting
from the new redo ptr (or we don't crash and don't need to recover).

To be able to recover from the previous redo ptr, we don't need to write
a full page image if we have already written one since the previous redo
ptr.

To be able to recover from the new redo ptr, we don't need to write a
full page image, if we haven't flushed the page yet. It will be written
and fsync'd by the time the checkpoint finishes.

IOW, we can skip full page images of pages that we have already taken a
full page image of since previous checkpoint, and we haven't flushed yet
during the current checkpoint.

This might reduce the overall WAL I/O a little bit, but more
importantly, it spreads the impact of taking full page images over the
checkpoint duration. That's a good thing on its own, but it also makes
it unnecessary to compensate for the full_page_writes rush in the
checkpoint smoothing.

I'm still trying to get my head around the bookkeeping required to get
that right; I think it's possible using the new BM_CHECKPOINT_NEEDED
flag and a new flag in the page header to mark pages that we've skipped
taking the full page image when it was last modified.

For 8.3, we should probably just do some simple compensation in the
checkpoint throttling code, if we want to do anything at all. But this
is something to think about in the future.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-07-03 10:04:55
Message-ID:	87y7hx4xbs.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:

> For 8.3, we should probably just do some simple compensation in the checkpoint
> throttling code, if we want to do anything at all. But this is something to
> think about in the future.

Just as a stress test it might be interesting to run a quick tpcc test with
very short checkpoint intervals. Something like 30s. Just to make sure that
the logic is all correct and unexpected things don't start happening.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-07-03 14:02:57
Message-ID:	7002.1183471377@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> While thinking about this, I made an observation on full_page_writes.
> Currently, we perform a full page write whenever LSN < RedoRecPtr. If
> we're clever, we can skip or defer some of the full page writes:

I'm not convinced this is safe; in particular, ISTM that a PITR slave
following the WAL log is likely to be at risk if it tries to restart
from the checkpoint you've omitted some full-page-images after. There's
no guarantee it will have flushed pages at the same spots the master did.

regards, tom lane

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Patches <pgsql-patches(at)postgresql(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2007-09-26 08:32:21
Message-ID:	200709260832.l8Q8WLG10662@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> > Here's latest revision of Itagaki-sans Load Distributed Checkpoints patch:
>
> Applied with some minor revisions to make some of the internal APIs a
> bit cleaner; mostly, it seemed like a good idea to replace all those
> bool parameters with a flag-bits approach, so that you could have
> something like "CHECKPOINT_FORCE | CHECKPOINT_WAIT" instead of
> "false, true, true, false" ...
>
> For the moment I removed all the debugging elog's in the patch.
> We still have Greg Smith's checkpoint logging patch to look at
> (which I suppose needs adjustment now), and that seems like the
> appropriate venue to consider what to put in.
>
> Also, the question of redesigning the bgwriter's LRU scan is
> still open. I believe that's on Greg's plate, too.
>
> One other closely connected item that might be worth looking at is the
> code for creating new future xlog segments (PreallocXlogFiles). Greg
> was griping upthread about xlog segment creation being a real
> performance drag. I realized that as we currently have it set up, the
> checkpoint code is next to useless for high-WAL-volume installations,
> because it only considers making *one* future XLOG segment. Once you've
> built up enough XLOG segments, the system isn't too bad about recycling
> them, but there will be a nasty startup transient where foreground
> processes have to stop and make the things. I wonder whether it would
> help if we (a) have the bgwriter call PreallocXlogFiles during its
> normal loop, and (b) back the slop in PreallocXlogFiles way off, so that
> it will make a future segment as soon as we start using the last
> existing segment, instead of only when we're nearly done. This would at
> least make it more likely that the bgwriter does the work instead of a
> foreground process. I'm hesitant to go much further than that, because
> I don't want to bloat the minimum disk footprint for low-volume
> installations, but the minimum footprint is really 2 xlog files anyway...
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Patches <pgsql-patches(at)postgresql(dot)org>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Smith <gsmith(at)gregsmith(dot)com>
Subject:	Re: Load Distributed Checkpoints, final patch
Date:	2008-03-11 21:06:27
Message-ID:	200803112106.m2BL6Rp09789@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Added to TODO:

* Test to see if calling PreallocXlogFiles() from the background writer
will help with WAL segment creation latency

http://archives.postgresql.org/pgsql-patches/2007-06/msg00340.php

---------------------------------------------------------------------------

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +