Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: james(at)mansionfamily(dot)plus(dot)com
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-07-14 22:46:41
Message-ID: 51E32A51.7080309@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/14/13 5:28 PM, james wrote:
> Some random seeks during sync can't be helped, but if they are done when
> we aren't waiting for sync completion then they are in effect free.

That happens sometimes, but if you measure you'll find this doesn't
actually occur usefully in the situation everyone dislikes. In a write
heavy environment where the database doesn't fit in RAM, backends and/or
the background writer are constantly writing data out to the OS. WAL is
going out constantly as well, and in many cases that's competing for the
disks too. The most popular blocks in the database get high usage
counts and they never leave shared_buffers except at checkpoint time.
That's easy to prove to yourself with pg_buffercache.

And once the write cache fills, every I/O operation is now competing.
There is nothing happening for free. You're stealing I/O from something
else any time you force a write out. The optimal throughput path for
checkpoints turns out to be delaying every single bit of I/O as long as
possible, in favor of the [backend|bgwriter] writes and WAL. Whenever
you delay a buffer write, you have increased the possibility that
someone else will write the same block again. And the buffers being
written by the checkpointer are, on average, the most popular ones in
the database. Writing any of them to disk pre-emptively has high odds
of writing the same block more than once per checkpoint. And that easy
to measure waste--it shows as more writes/transaction in
pg_stat_bgwriter--it hurts throughput more than every reduction in seek
overhead you might otherwise get from early writes. The big gain isn't
chasing after cheap seeks. The best path is the one that decreases the
total volume of writes.

We played this game with the background writer work for 8.3. The main
reason the one committed improved on the original design is that it
completely eliminated doing work on popular buffers in advance.
Everything happens at the last possible time, which is the optimal
throughput situation. The 8.1/8.2 BGW used to try and write things out
before they were strictly necessary, in hopes that that I/O would be
free. But it rarely was, while there was always a cost to forcing them
to disk early. And that cost is highest when you're talking about the
higher usage blocks the checkpointer tends to write. When in doubt,
always delay the write in hopes it will be written to again and you'll
save work.

> So it occurs to me that perhaps we can watch for patterns where we have
> groups of adjacent writes that might stream, and when they form we might
> schedule them...

Stop here. I mentioned something upthread that is worth repeating.

The checkpointer doesn't know what concurrent reads are happening. We
can't even easily make it know, not without adding a whole new source of
IPC and locking contention among clients.

Whatever scheduling decision the checkpointer might make with its
limited knowledge of system I/O is going to be poor. You might find a
100% write benchmark that it helps, but those are not representative of
the real world. In any mixed read/write case, the operating system is
likely to do better. That's why things like sorting blocks sometimes
seem to help someone, somewhere, with one workload, but then aren't
repeatable.

We can decide to trade throughput for latency by nudging the OS to deal
with its queued writes more regularly. That will result in more total
writes, which is the reason throughput drops.

But the idea that PostgreSQL is going to do a better global job of I/O
scheduling, that road is a hard one to walk. It's only going to happen
if we pull all of the I/O into the database *and* do a better job on the
entire process than the existing OS kernel does. That sort of dream, of
outperforming the filesystem, it is very difficult to realize. There's
a good reason that companies like Oracle stopped pushing so hard on
recommending raw partitions.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2013-07-14 23:33:29 ECPG timestamp '%j'
Previous Message Greg Smith 2013-07-14 21:59:15 Re: Improvement of checkpoint IO scheduler for stable transaction responses