Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-06-26 08:37:32
Message-ID: 51CAA84C.7030901@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you for comments!

>> On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
>>> Hmm, so the write patch doesn't do much, but the fsync patch makes the response
>>> times somewhat smoother. I'd suggest that we drop the write patch for now, and
>>> focus on the fsyncs.
Write patch is effective in TPS! I think that delay of checkpoint write is caused
long time fsync and heavy load in fsync phase. Because it go slow disk right in write
phase. Therefore, combination of write patch and fsync patch are suiter each
other than
only write patch. I think that amount of WAL write in beginning of checkpoint can
indicate effect of write patch.

>>> What checkpointer_fsync_delay_ratio and checkpointer_fsync_delay_threshold
>>> settings did you use with the fsync patch? It's disabled by default.
I used these parameters.
checkpointer_fsync_delay_ratio = 1
checkpointer_fsync_delay_threshold = 1000ms
As a matter of fact, I used long time sleep in slow fsyncs.

And other maintains parameters are here.
checkpoint_completion_target = 0.7
checkpoint_smooth_target = 0.3
checkpoint_smooth_margin = 0.5
checkpointer_write_delay = 200ms

>>> Attached is a quick patch to implement a fixed, 100ms delay between fsyncs, and the
>>> assumption that fsync phase is 10% of the total checkpoint duration. I suspect 100ms
>>> is too small to have much effect, but that happens to be what we have
currently in
>>> CheckpointWriteDelay(). Could you test this patch along with yours? If you can test
>>> with different delays (e.g 100ms, 500ms and 1000ms) and different ratios between
>>> the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea of how sensitive the
>>> test case is to those settings.
It seems interesting algorithm! I will test it in same setting and study about
your patch essence.

(2013/06/26 5:28), Heikki Linnakangas wrote:
> On 25.06.2013 23:03, Robert Haas wrote:
>> On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
>> <hlinnakangas(at)vmware(dot)com> wrote:
>>> I'm not sure it's a good idea to sleep proportionally to the time it took to
>>> complete the previous fsync. If you have a 1GB cache in the RAID controller,
>>> fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
>>> will return immediately. So we proceed fsyncing other files, until the cache
>>> is full and the fsync blocks. But once we fill up the cache, it's likely
>>> that we're hurting concurrent queries. ISTM it would be better to stay under
>>> that threshold, keeping the I/O system busy, but never fill up the cache
>>> completely.
>>
>> Isn't the behavior implemented by the patch a reasonable approximation
>> of just that? When the fsyncs start to get slow, that's when we start
>> to sleep. I'll grant that it would be better to sleep when the
>> fsyncs are *about* to get slow, rather than when they actually have
>> become slow, but we have no way to know that.
>
> Well, that's the point I was trying to make: you should sleep *before* the fsyncs
> get slow.
Actuary, fsync time is changed by progress of background disk writes in OS. We
cannot know about progress of background disk write before fsyncs. I think
Robert's argument is right. Please see under following log messages.

* fsync file which had been already wrote in disk
DEBUG: 00000: checkpoint sync: number=23 file=base/16384/16413.5 time=2.546 msec
DEBUG: 00000: checkpoint sync: number=24 file=base/16384/16413.6 time=3.174 msec
DEBUG: 00000: checkpoint sync: number=25 file=base/16384/16413.7 time=2.358 msec
DEBUG: 00000: checkpoint sync: number=26 file=base/16384/16413.8 time=2.013 msec
DEBUG: 00000: checkpoint sync: number=27 file=base/16384/16413.9 time=1232.535
msec
DEBUG: 00000: checkpoint sync: number=28 file=base/16384/16413_fsm time=0.005 msec

* fsync file which had not been wrote in disk very much
DEBUG: 00000: checkpoint sync: number=54 file=base/16384/16419.8 time=3408.759
msec
DEBUG: 00000: checkpoint sync: number=55 file=base/16384/16419.9 time=3857.075
msec
DEBUG: 00000: checkpoint sync: number=56 file=base/16384/16419.10
time=13848.237 msec
DEBUG: 00000: checkpoint sync: number=57 file=base/16384/16419.11 time=898.836
msec
DEBUG: 00000: checkpoint sync: number=58 file=base/16384/16419_fsm time=0.004 msec
DEBUG: 00000: checkpoint sync: number=59 file=base/16384/16419_vm time=0.002 msec

I think it is wasteful of sleep every fsyncs including short time, and fsync time
performance is also changed by hardware which is like RAID card and kind of or
number of disks and OS. So it is difficult to set fixed-sleep-time. My proposed
method will be more adoptive in these cases.

>> The only feedback we have on how bad things are is how long it took
>> the last fsync to complete, so I actually think that's a much better
>> way to go than any fixed sleep - which will often be unnecessarily
>> long on a well-behaved system, and which will often be far too short
>> on one that's having trouble. I'm inclined to think think Kondo-san
>> has got it right.
>
> Quite possible, I really don't know. I'm inclined to first try the simplest thing
> possible, and only make it more complicated if that's not good enough.
> Kondo-san's patch wasn't very complicated, but nevertheless a fixed sleep between
> every fsync, unless you're behind the schedule, is even simpler. In particular,
> it's easier to tie that into the checkpoint scheduler - I'm not sure how you'd
> measure progress or determine how long to sleep unless you assume that every
> fsync is the same.
I think it is important in phase of fsync that short time as possible without IO
freeze, keep schedule of checkpoint, and good for executing transactions. I try
to make progress patch in that's point of view. By the way, executing DBT-2
benchmark has long time(It may be four hours.). For that reason I hope that don't
mind my late reply very much! :-)

Best Regards,
--
Mitsumasa KONDO
NTT Open Sorce Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2013-06-26 08:38:31 Re: FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]
Previous Message Szymon Guz 2013-06-26 08:35:09 Re: [PATCH] Fix conversion for Decimal arguments in plpython functions