Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-07-19 07:53:36
Message-ID: 51E8F080.4040506@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(2013/07/19 0:41), Greg Smith wrote:
> On 7/18/13 11:04 AM, Robert Haas wrote:
>> On a system where fsync is sometimes very very slow, that
>> might result in the checkpoint overrunning its time budget - but SO
>> WHAT?
>
> Checkpoints provide a boundary on recovery time. That is their only purpose.
> You can always do better by postponing them, but you've now changed the agreement
> with the user about how long recovery might take.
Recently, a user who think system availability is important uses synchronous
replication cluster. And, as Robert saying, a user who cannot build cluster
system will not use this function in GUC.

When it became IO busy in calling fsync(), my patch does not take the over IO
load in fsync(). Actually, it is the same as OS writeback structure. I read
kernel source code which is fs/fs-writeback.c in linux-2.6.32-358.0.1.el6. It is
latest RHEL6.4 kernel code. It seems that wb_writeback() controlled disk IO in
OS-writeback function. Please see under source code. If OS think IO is busy, it
does not write more IO for bail.

fs/fs-writeback.c @wb_writeback()
623 /*
624 * For background writeout, stop when we are below the
625 * background dirty threshold
626 */
627 if (work->for_background && !over_bground_thresh())
628 break;
629
630 wbc.more_io = 0;
631 wbc.nr_to_write = MAX_WRITEBACK_PAGES;
632 wbc.pages_skipped = 0;
633
634 trace_wbc_writeback_start(&wbc, wb->bdi);
635 if (work->sb)
636 __writeback_inodes_sb(work->sb, wb, &wbc);
637 else
638 writeback_inodes_wb(wb, &wbc);
639 trace_wbc_writeback_written(&wbc, wb->bdi);
640 work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
641 wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
642
643 /*
644 * If we consumed everything, see if we have more
645 */
646 if (wbc.nr_to_write <= 0)
647 continue;
648 /*
649 * Didn't write everything and we don't have more IO, bail
650 */
651 if (!wbc.more_io)
652 break;
653 /*
654 * Did we write something? Try for more
655 */
656 if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
657 continue;
658 /*
659 * Nothing written. Wait for some inode to
660 * become available for writeback. Otherwise
661 * we'll just busyloop.
662 */
663 spin_lock(&inode_lock);
664 if (!list_empty(&wb->b_more_io)) {
665 inode = list_entry(wb->b_more_io.prev,
666 struct inode, i_list);
667 trace_wbc_writeback_wait(&wbc, wb->bdi);
668 inode_wait_for_writeback(inode);
669 }
670 spin_unlock(&inode_lock);
671 }
672
673 return wrote;

I want you to read especially point that is line 631, 651, and 656.
MAX_WRITEBACK_PAGES is 1024 (1024 * 4096 byte). OS writeback scheduler does not
write over MAX_WRITEBACK_PAGES. Because, if it write big data than
MAX_WRITEBACK_PAGES, it will be IO-busy. And if it cannot write at all, OS think
it needs recovery of IO performance. It is same as my patch's logic.

In addition, if you set a large value of a checkpoint_timeout or
checkpoint_complete_taget, you have said that performance is improved, but is it
true in all the cases? Since the write of the dirty buffer which passed 30
seconds or more is carried out at intervals of 5 seconds, as there are many
recesses of a write, a possibility of becoming an inefficient random write. For
example, as for the worsening case, when the sleep time for 200 ms is inserted
each time, since only 25 page (200 KB) can write in 5 seconds. I think it is bad
efficiency to write. When a checkpoint complication target is actually enlarged,
performance may fall in some cases. I think this as the last fsync having become
heavy owing to having write in slowly.

I would like to make a itemizing list which can be proof of my patch from you.
Because DBT-2 benchmark spent lot of time about 1 setting test per 3 - 4 hours.
Of course, I think it is important to obtain your consent.

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Samrat Revagade 2013-07-19 09:24:16 Using ini file to setup replication
Previous Message Ashutosh Bapat 2013-07-19 06:55:31 Re: AGG_PLAIN thinks sorts are free