Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: "'Ants Aasma'" <ants(at)cybertec(dot)at>, "'Greg Smith'" <greg(at)2ndquadrant(dot)com>
Cc: "'Heikki Linnakangas'" <hlinnakangas(at)vmware(dot)com>, "'PostgreSQL-development'" <pgsql-hackers(at)postgresql(dot)org>, "'KONDO Mitsumasa'" <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-07-17 11:54:57
Message-ID: 00a201ce82e4$76387950$62a96bf0$@kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, July 16, 2013 10:16 PM Ants Aasma wrote:
> On Jul 14, 2013 9:46 PM, "Greg Smith" <greg(at)2ndquadrant(dot)com> wrote:
> > I updated and re-reviewed that in 2011:
> http://www.postgresql.org/message-id/4D31AE64.3000202@2ndquadrant.com
> and commented on why I think the improvement was difficult to reproduce
> back then. The improvement didn't follow for me either. It would take
> a really amazing bit of data to get me to believe write sorting code is
> worthwhile after that. On large systems capable of dirtying enough
> blocks to cause a problem, the operating system and RAID controllers
> are already sorting block. And *that* sorting is also considering
> concurrent read requests, which are a lot more important to an
> efficient schedule than anything the checkpoint process knows about.
> The database doesn't have nearly enough information yet to compete
> against OS level sorting.
>
> That reasoning makes no sense. OS level sorting can only see the
> writes in the time window between PostgreSQL write, and being forced
> to disk. Spread checkpoints sprinkles the writes out over a long
> period and the general tuning advice is to heavily bound the amount of
> memory the OS willing to keep dirty. This makes probability of
> scheduling adjacent writes together quite low, the merging window
> being limited either by dirty_bytes or dirty_expire_centisecs. The
> checkpointer has the best long term overview of the situation here, OS
> scheduling only has the short term view of outstanding read and write
> requests. By sorting checkpoint writes it is much more likely that
> adjacent blocks are visible to OS writeback at the same time and will
> be issued together.

I think Oracle also use similar concept for making writes efficient, and
they have patent also for this technology which you can find at below link:
http://www.google.com/patents/US7194589?dq=645987&hl=en&sa=X&ei=kn7mUZ-PIsWq
rAe99oDgBw&sqi=2&pjf=1&ved=0CEcQ6AEwAw

Although Oracle has different concept for performing checkpoint writes, but
I thought of sharing the above link with you, so that unknowingly we should
not go into wrong path.

AFAIK instead of depending on OS buffers, they use direct I/O and infact in
the patent above they are using temporary buffer (Claim 3) to sort the
writes which is not the same idea as far as I can understand by reading
above thread.

With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2013-07-17 12:09:04 Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Previous Message Greg Smith 2013-07-17 10:54:31 Re: Improvement of checkpoint IO scheduler for stable transaction responses