Re: Spread checkpoint sync

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spread checkpoint sync
Date: 2011-01-15 15:31:05
Message-ID: 4D31BDB9.9010602@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas wrote:
> I'll believe it when I see it. How about this:
>
> a 1
> a 2
> sync a
> b 1
> b 2
> sync b
> c 1
> c 2
> sync c
>
> Or maybe some variant, where we become willing to fsync a file a
> certain number of seconds after writing the last block, or when all
> the writes are done, whichever comes first.

That's going to give worse performance than the current code in some
cases. The goal of what's in there now is that you get a sequence like
this:

a1
b1
a2
[Filesystem writes a1]
b2
[Filesystem writes b1]
sync a [Only has to write a2]
sync b [Only has to write b2]

This idea works until you to get where the filesystem write cache is so
large that it becomes lazier about writing things. The fundamental
idea--push writes out some time before the sync, in hopes the filesystem
will get to them before that said--it not unsound. On some systems,
doing the sync more aggressively than that will be a regression. This
approach just breaks down in some cases, and those cases are happening
more now because their likelihood scales with total RAM. I don't want
to screw the people with smaller systems, who may be getting
considerable benefit from the existing sequence. Today's little
systems--which are very similar to the high-end ones the spread
checkpoint stuff was developed on during 8.3--do get some benefit from
it as far as I know.

Anyway, now that the ability to get logging on all this stuff went in
during the last CF, it's way easier to just setup a random system to run
tests in this area than it used to be. Whatever testing does happen
should include, say, a 2GB laptop with a single hard drive in it. I
think that's the bottom of what is reasonable to consider a reasonable
target for tweaking write performance on, given hardware 9.1 is likely
to be deployed on.

> How does the checkpoint target give you any time to sync them? Unless
> you squeeze the writes together more tightly, but that seems sketchy.
>

Obviously the checkpoint target idea needs to be shuffled around some
too. I was thinking of making the new default 0.8, and having it split
the time in half for write and sync. That will make the write phase
close to the speed people are seeing now, at the default of 0.5, while
giving some window for spread sync too. The exact way to redistribute
that around I'm not so concerned about yet. When I get to where that's
the most uncertain thing left I'll benchmark the TPS vs. latency
trade-off and see what happens. If the rest of the code is good enough
but this just needs to be tweaked, that's a perfect thing to get beta
feedback to finalize.

> Well you don't have to put it in shared memory on account of any of
> that. You can just hang it on a global variable.
>

Hmm. Because it's so similar to other things being allocated in shared
memory, I just automatically pushed it over to there. But you're right;
it doesn't need to be that complicated. Nobody is touching it but the
background writer.

> If we can find something that's a modest improvement on the
> status quo and we can be confident in quickly, good, but I'd rather
> have 9.1 go out the door on time without fully fixing this than delay
> the release.
>

I'm not somebody who needs to be convinced of that. There are two near
commit quality pieces of this out there now:

1) Keep some BGW cleaning and fsync absorption going while sync is
happening, rather than starting it and ignoring everything else until
it's done.

2) Compact fsync requests when the queue fills

If that's all we can get for 9.1, it will still be a major improvement.
I realize I only have a very short period of time to complete a major
integration breakthrough on the pieces floating around before the goal
here has to drop to something less ambitious. I head to the West Coast
for a week on the 23rd; I'll be forced to throw in the towel at that
point if I can't get the better ideas we have in pieces here all
assembled well by then.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-01-15 15:35:34 Re: Streaming base backups
Previous Message Tom Lane 2011-01-15 15:30:04 Re: Streaming base backups