Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-07-23 02:53:14
Message-ID: 51EDF01A.4050006@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/22/13 4:52 AM, KONDO Mitsumasa wrote:
> The writeback source code which I indicated part of writeback is almost
> same as community kernel (2.6.32.61). I also read linux kernel 3.9.7,
> but it is almost same this part.

The main source code difference comes from going back to the RedHat 5
kernel, which means 2.6.18. For many of these versions, you are right
that it is only the tuning parameters that were changed in newer versions.

Optimizing performance for the old RHEL5 kernel isn't the most important
thing, but it's helpful to know the things it does very badly.

> My fsync patch is only sleep returned succece of fsync and maximum sleep
> time is set to 10 seconds. It does not cause bad for this problem.

It's easy to have hundreds of relations that are getting fsync calls
during a checkpoint. If you have 100 relations getting a 10 second
sleep each, you could potentially delay checkpoints by 17 minutes this
way. I regularly see systems where shared_buffers=8GB and there are 200
to 400 relation segments that need a sync during a checkpoint.

This is the biggest problem with your submission. Once you give up
following the checkpoint schedule carefully, it is very easy to end up
with large checkpoint deadline misses on production servers. If someone
thinks they are doing a checkpoint every 5 minutes, but your patch makes
them take 20 minutes instead, that is bad. They will not expect that a
crash might have to replay that much activity before the server is
useful again.

>> You also don't seem afraid of how exceeding the
>> checkpoint timeout is a very bad thing yet.
> I think it is important that why this problem was caused. We should try
> to find the cause of which program has bug or problem.

The checkpointer process is the problem. There's no filesystem bug or
complicated issues involved in many of the bad cases. Here is a simple
example that shows how the toughest problem cases happen:

-64GB of RAM
-10% dirty_background_ratio = 6GB of dirty writes = 6144MB
-2MB/s random I/O when concurrent reads are heavy
-3027 seconds to clear the cache = 51 minutes

That's how you get to an example like the one in my slides:

LOG: checkpoint complete: wrote 33282 bu ers (3.2%); 0 transaction log
file(s) added, 60 removed, 129 recycled; write=228.848 s, sync=4628.879
s, total=4858.859 s

It's very hard to do better on these, and I don't expect any change to
help this a lot. But I don't want to see a change committed that makes
this sort of checkpoint 17 minutes longer if there's 100 relations
involved either.

> My patch not only improvement of throughput but also
> realize stable response time at fsync phase in checkpoint.

The main reason your patch improves latency and throughput is that it
makes checkpoints farther apart. That's why I drew you a graph showing
how the time between checkpoints lined up perfectly with TPS. If it was
only a small problem it would be worth considering, but I think it's
likely to end up with these >15 minute I've outlined here instead.

> And I servey about ext3 file system.

I wouldn't worry too much about the problems ext3 has. Like the old
RHEL5 kernel I was commenting about above, there are a lot of ext3
systems out there. But we can't do a lot about getting good performance
from them. It's only important to test that you're not making them a
lot worse with a change.

> My system block size is 4096, but
> 8192 or more seems to better. It will decrease number of inode and get
> more large sequential disk fields.

I normally increase read-ahead on Linux systems to get faster speed on
sequential disk throughput. Changing the block size might work better
in some cases, but not many people are willing to do that. Read-ahead
is very easy to change at any time.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2013-07-23 03:01:55 Re: [9.4 CF 1] And then there were 5
Previous Message Tom Lane 2013-07-23 02:48:28 Re: [9.4 CF 1] And then there were 5