Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-06-16 14:27:56
Message-ID: 51BDCB6C.1090507@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10.06.2013 13:51, KONDO Mitsumasa wrote:
> I create patch which is improvement of checkpoint IO scheduler for
> stable transaction responses.
>
> * Problem in checkpoint IO schedule in heavy transaction case
> When heavy transaction in database, I think PostgreSQL checkpoint
> scheduler has two problems at start and end of checkpoint. One problem
> is IO heavy when starting initial checkpoint in rounds of checkpoint.
> This problem was caused by full-page-write which cause WAL IO in fast
> page writes after checkpoint write page. Therefore, when starting
> checkpoint, WAL-based checkpoint scheduler wrong judgment that is late
> schedule by full-page-write, nevertheless checkpoint schedule is not
> late. This is caused bad transaction response. I think WAL-based
> checkpoint scheduler was not property in starting checkpoint.

Yeah, the checkpoint scheduling logic doesn't take into account the
heavy WAL activity caused by full page images. That's an interesting
phenomenon, but did you actually see that causing a problem in your
tests? I couldn't tell from the results you posted what the impact of
that was. Could you repeat the tests separately with the two separate
patches you posted later in this thread?

Rationalizing a bit, I could even argue to myself that it's a *good*
thing. At the beginning of a checkpoint, the OS write cache should be
relatively empty, as the checkpointer hasn't done any writes yet. So it
might make sense to write a burst of pages at the beginning, to
partially fill the write cache first, before starting to throttle. But
this is just handwaving - I have no idea what the effect is in real life.

Another thought is that rather than trying to compensate for that effect
in the checkpoint scheduler, could we avoid the sudden rush of full-page
images in the first place? The current rule for when to write a full
page image is conservative: you don't actually need to write a full page
image when you modify a buffer that's sitting in the buffer cache, if
that buffer hasn't been flushed to disk by the checkpointer yet, because
the checkpointer will write and fsync it later. I'm not sure how much it
would smoothen WAL write I/O, but it would be interesting to try.

> Second problem is fsync freeze problem in end of checkpoint.
> Normally, checkpoint write is executed in background by OS's IO
> scheduler. But when it does not correctly work, end of checkpoint
> fsync was caused IO freeze and slower transactions. Unexpected slow
> transaction will cause monitor error in HA-cluster and decrease
> user-experience in application service. It is especially serious
> problem in cloud and virtual server database system which does not
> have IO performance. However we don't have solution in
> postgresql.conf parameter very much. We prefer checkpoint time to
> fast response transactions. In fact checkpoint time is short, and it
> becomes little bit long that is not problem. You may think that
> checkpoint_segments and checkpoint_timeout are set larger value,
> however large checkpoint_segments affects file-cache which is not
> read and is wasted, and large checkpoint_timeout was caused
> long-time crash-recovery.

A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
www.postgresql.org/message-id/flat/20070614153758(dot)6A62(dot)ITAGAKI(dot)TAKAHIRO(at)oss(dot)ntt(dot)co(dot)jp(dot)
He posted very promising performance numbers, but it was dropped because
Tom couldn't reproduce the numbers, and because sorting requires
allocating a large array, which has the risk of running out of memory,
which would be bad when you're trying to checkpoint.

Apart from the direct performance impact of that patch, sorting the
writes would allow us to interleave the fsyncs with the writes. You
would write out all buffers for relation A, then fsync it, then all
buffers for relation B, then fsync it, and so forth. That would
naturally spread out the fsyncs.

If we don't mind scanning the buffer cache several times, we don't
necessarily even need to sort the writes for that. Just scan the buffer
cache for all buffers belonging to relation A, then fsync it. Then scan
the buffer cache again, for all buffers belonging to relation B, then
fsync that, and so forth.

> Bad point of my patch is longer checkpoint. Checkpoint time was
> increased about 10% - 20%. But it can work correctry on schedule-time in
> checkpoint_timeout. Please see checkpoint result (http://goo.gl/NsbC6).

For a fair comparison, you should increase the
checkpoint_completion_target of the unpatched test, so that the
checkpoints run for roughly the same amount of time with and without the
patch. Otherwise the benefit you're seeing could be just because of a
more lazy checkpoint.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2013-06-16 14:45:01 Re: minor patch submission: CREATE CAST ... AS EXPLICIT
Previous Message Marko Kreen 2013-06-16 13:36:14 Re: Processing long AND/OR lists