Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-07-04 13:05:55
Message-ID: 20130704130555.GA1403@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2013-07-04 21:28:11 +0900, KONDO Mitsumasa wrote:
> >That would move all the vm and fsm forks to separate directories,
> >which would cut down the number of files in the main-fork directory
> >significantly. That might be worth doing independently of the issue
> >you're raising here. For large clusters, you'd even want one more
> >level to keep the directories from getting too big:
> >
> >base/${DBOID}/${FORK}/${X}/${RELFILENODE}
> >
> >...where ${X} is two hex digits, maybe just the low 16 bits of the
> >relfilenode number. But this would be not as good for small clusters
> >where you'd end up with oodles of little-tiny directories, and I'm not
> >sure it'd be practical to smoothly fail over from one system to the
> >other.
> It seems good idea! In generally, base directory was not seen by user.
> So it should be more efficient arrangement for performance and adopt for
> large database.
>
> > Presumably the smaller segsize is better because we don't
> > completely stall the system by submitting up to 1GB of io at once. So,
> > if we were to do it in 32MB chunks and then do a final fsync()
> > afterwards we might get most of the benefits.
> Yes, I try to test this setting './configure --with-segsize=0.03125' tonight.
> I will send you this test result tomorrow.

I don't like going in this direction at all:
1) it breaks pg_upgrade. Which means many of the bigger users won't be
able to migrate to this and most packagers would carry the old
segsize around forever.
Even if we could get pg_upgrade to split files accordingly link mode
would still be broken.
2) It drastically increases the amount of file handles neccessary and by
extension increases the amount of open/close calls. Those aren't all
that cheap. And it increases metadata traffic since mtime/atime are
kept for more files. Also, file creation is rather expensive since it
requires metadata transaction on the filesystem level.
3) It breaks readahead since that usually only works within a single
file. I am pretty sure that this will significantly slow down
uncached sequential reads on larger tables.

> (2013/07/03 22:39), Andres Freund wrote:> On 2013-07-03 17:18:29 +0900
> > Hm. I wonder how much of this could be gained by doing a
> > sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
> > the original checkpoint-pass through the buffers or when fsyncing the
> > files.
> Sync_file_rage system call is interesting. But it was supported only by
> Linux kernel 2.6.22 or later. In postgresql, it will suits Robert's idea
> which does not depend on kind of OS.

Well. But it can be implemented without breaking things... Even if we
don't have sync_file_range() we can cope by simply doing fsync()s more
frequently. For every open file keep track of the amount of buffers
dirtied and every 32MB or so issue an fdatasync()/fsync().

> I think that best way to write buffers in checkpoint is sorted by buffer's
> FD and block-number with small segsize setting and each property sleep
> times. It will realize genuine sorted checkpint with sequential disk
> writing!

That would mke regular fdatasync()ing even easier.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2013-07-04 13:09:02 Re: [9.4 CF 1] The Commitfest Slacker List
Previous Message Pavel Stehule 2013-07-04 13:01:31 Re: Grouping Sets