Re: Spread checkpoint sync

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spread checkpoint sync
Date: 2011-01-15 14:25:40
Message-ID: 4D31AE64.3000202@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas wrote:
> Idea #2: At the beginning of a checkpoint when we scan all the
> buffers, count the number of buffers that need to be synced for each
> relation. Use the same hashtable that we use for tracking pending
> fsync requests. Then, interleave the writes and the fsyncs...
>
> Idea #3: Stick with the idea of a fixed delay between fsyncs, but
> compute how many fsyncs you think you're ultimately going to need at
> the start of the checkpoint, and back up the target completion time by
> 3 s per fsync from the get-go, so that the checkpoint still finishes
> on schedule.
>

What I've been working on is something halfway between these two ideas.
I have a patch, and it doesn't work right yet because I just broke it,
but since I have some faint hope this will all come together any minute
now I'm going to share it before someone announces a deadline has passed
or something. (whistling). I'm going to add this messy thing and the
patch you submitted upthread to the CF list; I'll review yours, I'll
either fix the remaining problem in this one myself or rewrite to one of
your ideas, and then it's onto a round of benchmarking.

Once upon a time we got a patch from Itagaki Takahiro whose purpose was
to sort writes before sending them out:

http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

This didn't really work repeatedly for everyone because of the now well
understood ext3 issues--I never replicated that speedup at the time for
example. And this was before the spread checkpoint code was in 8.3.
The hope was that it wasn't really going to be necessary after that anyway.

Back to today...instead of something complicated, it struck me that if I
just had a count of exactly how many files were involved in each
checkpoint, that would be helpful. I could keep the idea of a fixed
delay between fsyncs, but just auto-tune that delay amount based on the
count. And how do you count the number of unique things in a list?
Well, you can always sort them. I thought that if the sorted writes
patch got back to functional again, it could serve two purposes. It
would group all of the writes for a file together, and if you did the
syncs in the same sorted order they would have the maximum odds of
discovering the data was already written. So rather than this possible
order:

table block
a 1
b 1
c 1
c 2
b 2
a 2
sync a
sync b
sync c

Which has very low odds of the sync on "a" finishing quickly, we'd get
this one:

table block
a 1
a 2
b 1
b 2
c 1
c 2
sync a
sync b
sync c

Which sure seems like a reasonable way to improve the odds data has been
written before the associated sync comes along.

Also, I could just traverse the sorted list with some simple logic to
count the number of unique files, and then set the delay between fsync
writes based on it. In the above, once the list was sorted, easy to
just see how many times the table name changes on a linear scan of the
sorted data. 3 files, so if the checkpoint target gives me, say, a
minute of time to sync them, I can delay 20 seconds between. Simple
math, and exactly the sort I used to get reasonable behavior on the busy
production system this all started on. There's some unresolved
trickiness in the segment-driven checkpoint case, but one thing at a time.

So I fixed the bitrot on the old sorted patch, which was fun as it came
from before the 8.3 changes. It seemed to work. I then moved the
structure it uses to hold the list of buffers to write, the thing that's
sorted, into shared memory. It's got a predictable maximum size,
relying on palloc in the middle of the checkpoint code seems bad, and
there's some potential gain from not reallocating it every time through.

Somewhere along the way, it started doing this instead of what I wanted:

BadArgument("!(((header->context) != ((void *)0) &&
(((((Node*)((header->context)))->type) == T_AllocSetContext))))", File:
"mcxt.c", Line: 589)

(that's from initdb, not a good sign)

And it's left me wondering whether this whole idea is a dead end I used
up my window of time wandering down.

There's good bits in the patch I submitted for the last CF and in the
patch you wrote earlier this week. This unfinished patch may be a
valuable idea to fit in there too once I fix it, or maybe it's
fundamentally flawed and one of the other ideas you suggested (or I have
sitting on the potential design list) will work better. There's a patch
integration problem that needs to be solved here, but I think almost all
the individual pieces are available. I'd hate to see this fail to get
integrated now just for lack of time, considering the problem is so
serious when you run into it.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Attachment Content-Type Size
new-sorted-writes-v1.patch text/x-patch 8.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-01-15 14:40:59 Re: Spread checkpoint sync
Previous Message Robert Haas 2011-01-15 14:24:07 Re: Bug in pg_describe_object, patch v2