Quick Links

Re: Instrument checkpoint sync calls

Lists:	pgsql-hackers

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Instrument checkpoint sync calls
Date:	2010-11-14 21:37:58
Message-ID:	4CE056B6.4050901@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attached patch adds some logging for each individual fsync call made
during a checkpoint, along with a summary at the end. You need to have
the following to see all of the detail:

log_checkpoints=on
log_min_messages=debug1

And here's a sample:

LOG: checkpoint starting: immediate force wait
DEBUG: checkpoint sync: file=1 time=1.946000 msec
DEBUG: checkpoint sync: file=2 time=0.666000 msec
DEBUG: checkpoint sync: file=3 time=0.004000 msec
LOG: checkpoint sync: files=3 longest=1.946000 msec average=0.872000 msec
LOG: checkpoint complete: wrote 3 buffers (0.1%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.002 s,
total=0.003 s

I think that it's reasonable for the sort of people who turn
log_checkpoints on to also get the sync summary line, thus it being
logged at LOG level. The detail on individual syncs might go to DEBUG2
or lower instead of DEBUG1 where I put it, that part I don't have a
strong opinion on. It's at DEBUG1 to make testing the patch without a
gigantic amount of log data also coming in easier.

Right now the code is written such that all the calls that grab timing
information are wrapped around "ifdef DEBUG_FSYNC", which is a variable
set to 1 that could be a compile-time option like DEBUG_DEADLOCK, to
allow turning this code path off at build time. I personally think that
if you're already making an fsync call and have log_checkpoints on, the
additional overhead of also timing that fsync is minimal even on
platforms where timing is slow (I don't have such a system to test that
assumption however). And I've seen enough value in troubleshooting
nasty checkpoint sync problems using this patch to feel it's worth
having even if it does add some overhead.

I'm a little concerned about log_checkpoints changing on me in the
middle of the execution of a checkpoint, which would cause some problems
here. Not sure if that's something I should explicitly code for, given
that all I think it will do is screw up one of the timing results. It
does seem a risk from the last minute self-review I just did of the code.

I'll give a sample program that stresses the system, generating slow
timing results and other types of bad behavior, along with the next
patch I submit here shortly.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

Attachment	Content-Type	Size
log-sync-v2.patch	text/x-patch	2.5 KB

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-14 22:08:19
Message-ID:	AANLkTi=prse2HvBnaj5Amzf9zZq7izNPs0APwx1MP-5z@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 14, 2010 at 22:37, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Attached patch adds some logging for each individual fsync call made during
> a checkpoint, along with a summary at the end. You need to have the
> following to see all of the detail:
>
> log_checkpoints=on
> log_min_messages=debug1
>
> And here's a sample:
>
> LOG: checkpoint starting: immediate force wait
> DEBUG: checkpoint sync: file=1 time=1.946000 msec
> DEBUG: checkpoint sync: file=2 time=0.666000 msec
> DEBUG: checkpoint sync: file=3 time=0.004000 msec
> LOG: checkpoint sync: files=3 longest=1.946000 msec average=0.872000 msec
> LOG: checkpoint complete: wrote 3 buffers (0.1%); 0 transaction log file(s)
> added, 0 removed, 0 recycled; write=0.000 s, sync=0.002 s, total=0.003 s
>
> I think that it's reasonable for the sort of people who turn log_checkpoints
> on to also get the sync summary line, thus it being logged at LOG level.

In that case, wouldn't it make more sense to add a couple of more
fields to the existing line? Easier to post-process the logfile that
way...

> The detail on individual syncs might go to DEBUG2 or lower instead of
> DEBUG1 where I put it, that part I don't have a strong opinion on. It's at
> DEBUG1 to make testing the patch without a gigantic amount of log data also
> coming in easier.
>
> Right now the code is written such that all the calls that grab timing
> information are wrapped around "ifdef DEBUG_FSYNC", which is a variable set
> to 1 that could be a compile-time option like DEBUG_DEADLOCK, to allow
> turning this code path off at build time. I personally think that if you're
> already making an fsync call and have log_checkpoints on, the additional
> overhead of also timing that fsync is minimal even on platforms where timing
> is slow (I don't have such a system to test that assumption however). And
> I've seen enough value in troubleshooting nasty checkpoint sync problems
> using this patch to feel it's worth having even if it does add some
> overhead.

It sounds like it should be - but that should be possible to measure, no?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-15 00:04:56
Message-ID:	4CE07928.3010507@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Magnus Hagander wrote:
>> I think that it's reasonable for the sort of people who turn log_checkpoints
>> on to also get the sync summary line, thus it being logged at LOG level.
>>
>
> In that case, wouldn't it make more sense to add a couple of more
> fields to the existing line? Easier to post-process the logfile that
> way...
>

It might. One trade-off is that if you're looking at the sync write
detail, the summary comes out in a similar form. And it was easy to put
in here--I'd have to return some new data out of the sync phase call in
order for that to show up in the main log. If there's general buy-in on
the idea, I could do all of that.

>> I personally think that if you're
>> already making an fsync call and have log_checkpoints on, the additional
>> overhead of also timing that fsync is minimal even on platforms where timing
>> is slow (I don't have such a system to test that assumption however)...
> It sounds like it should be - but that should be possible to measure, no?
>

What I was alluding to is that I know gettimeofday executes fast on my
Linux system here, so even if I did measure the overhead and showed it's
near zero that doesn't mean it will be so on every platform. The "how
long does it take to find out the current time on every supported
PostgreSQL platform?" question is one I'd like to have an answer to, but
it's hard to collect properly. All I know is that I don't have any
system where it's slow to properly test again here.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-15 19:30:53
Message-ID:	AANLkTinuwsWgbKgdZ_YvN7O0=GxAtEEVLXXyce6kJubL@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 14, 2010 at 7:04 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> It might. One trade-off is that if you're looking at the sync write detail,
> the summary comes out in a similar form. And it was easy to put in
> here--I'd have to return some new data out of the sync phase call in order
> for that to show up in the main log. If there's general buy-in on the idea,
> I could do all of that.

+1 for that approach.

> What I was alluding to is that I know gettimeofday executes fast on my Linux
> system here, so even if I did measure the overhead and showed it's near zero
> that doesn't mean it will be so on every platform. The "how long does it
> take to find out the current time on every supported PostgreSQL platform?"
> question is one I'd like to have an answer to, but it's hard to collect
> properly. All I know is that I don't have any system where it's slow to
> properly test again here.

I would be very surprised if we can find a system where gettimeofday()
takes a significant amount of time compared with fsync(). It might be
(probably is) too expensive to stick into code paths that are heavily
CPU-bounded, but surely the cost here is going to be dwarfed by the
fsync(), no? Unless maybe there's no I/O to be done anyway, but that
case doesn't seem important to optimize for. I think making this code
dependent on ifdef DEBUG_FSYNC is going to make in order of magnitude
less useful, so I'd really rather not do that unless we have some hard
evidence that there is a real problem to worry about here. Making it
conditional on log_checkpoints seems entirely sufficient to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-15 19:48:21
Message-ID:	5308.1289850501@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I would be very surprised if we can find a system where gettimeofday()
> takes a significant amount of time compared with fsync(). It might be
> (probably is) too expensive to stick into code paths that are heavily
> CPU-bounded, but surely the cost here is going to be dwarfed by the
> fsync(), no? Unless maybe there's no I/O to be done anyway, but that
> case doesn't seem important to optimize for.

I'm not sure I buy that --- the whole point of spread checkpoints is
that we hope the I/O happens before we actually call fsync.

> Making it
> conditional on log_checkpoints seems entirely sufficient to me.

But I'll agree with that. If you're turning on log_checkpoints then
you've given the system permission to indulge in extra overhead for
monitoring.

regards, tom lane

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-15 20:09:31
Message-ID:	4CE1937B.60603@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

So my task list is:

0) Rebase against the HEAD that just code related to this touched today

1) Assume that log_checkpoints is sufficient control over whether the
timing overhead added is worth collecting, and therefore remove the
half-baked idea of also wrapping with a compile-time option.

2) Have the sync summary returned upwards, so it can be put onto the
same line as the rest of the rest of the log_checkpoint info.

All seems reasonable to me. Will rev a new patch by tomorrow.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-15 20:19:00
Message-ID:	AANLkTi=g_sNqDcEMM6Y45ZasJHOgNXWAz-zSqzoYqcQH@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 15, 2010 at 2:48 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I would be very surprised if we can find a system where gettimeofday()
>> takes a significant amount of time compared with fsync(). It might be
>> (probably is) too expensive to stick into code paths that are heavily
>> CPU-bounded, but surely the cost here is going to be dwarfed by the
>> fsync(), no? Unless maybe there's no I/O to be done anyway, but that
>> case doesn't seem important to optimize for.
>
> I'm not sure I buy that --- the whole point of spread checkpoints is
> that we hope the I/O happens before we actually call fsync.

Possibly. Even then, it's not guaranteed, because it might well be
that fsync() has to do more grovelling through system data structures
to determine that there is in fact no work to be done than what
gettimeofday() has to do to read off a counter somewhere...

>> Making it
>> conditional on log_checkpoints seems entirely sufficient to me.
>
> But I'll agree with that. If you're turning on log_checkpoints then
> you've given the system permission to indulge in extra overhead for
> monitoring.

...but if we agree on the way forward then the details don't matter.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-22 01:46:29
Message-ID:	AANLkTinccLNKDF8mCrfR59ky1BO32PmNL2MeumUC7Z4k@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 15, 2010 at 3:09 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> So my task list is:
>
> 0) Rebase against the HEAD that just code related to this touched today
>
> 1) Assume that log_checkpoints is sufficient control over whether the timing
> overhead added is worth collecting, and therefore remove the half-baked idea
> of also wrapping with a compile-time option.
>
> 2) Have the sync summary returned upwards, so it can be put onto the same
> line as the rest of the rest of the log_checkpoint info.
>
> All seems reasonable to me. Will rev a new patch by tomorrow.

Did this get eaten by the email goblin, or you're still working on it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-22 19:27:30
Message-ID:	AANLkTimOw48vy3kznwjQQmGWzCDQp=mVGOO=ZvLJzEy2@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 15, 2010 at 12:09 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> So my task list is:
>
> 0) Rebase against the HEAD that just code related to this touched today
>
> 1) Assume that log_checkpoints is sufficient control over whether the timing
> overhead added is worth collecting, and therefore remove the half-baked idea
> of also wrapping with a compile-time option.
>
> 2) Have the sync summary returned upwards, so it can be put onto the same
> line as the rest of the rest of the log_checkpoint info.
>
> All seems reasonable to me. Will rev a new patch by tomorrow.

For the individual file sync times emitted under debug1, it would be
very handy if the file being synced was identified, for example
"relation base/16384/16523". Rather than being numbered sequentially
within a given checkpoint.

Cheers,

Jeff

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-24 06:02:11
Message-ID:	4CECAA63.1070908@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> Did this get eaten by the email goblin, or you're still working on it?
>

Fell behind due to an unfortunately timed bit of pneumonia. Hurray for
the health benefits of cross country flights. Will fix this up, rebase
my other patch, and head toward some more review/'Fest cleanup now that
I'm feeling better.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-24 11:56:33
Message-ID:	AANLkTinu4LN+7q7SUSTLech+mbVtSTSZOacjFraSps-+@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 1:02 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Robert Haas wrote:
>> Did this get eaten by the email goblin, or you're still working on it?
>
> Fell behind due to an unfortunately timed bit of pneumonia. Hurray for the
> health benefits of cross country flights. Will fix this up, rebase my other
> patch, and head toward some more review/'Fest cleanup now that I'm feeling
> better.

Ouch. Fringe benefits of consulting. Thanks for the update.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-26 17:25:44
Message-ID:	4CEFED98.50501@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes wrote:
> For the individual file sync times emitted under debug1, it would be
> very handy if the file being synced was identified, for example
> "relation base/16384/16523".

I was in the middle of looking into adding that already, so consider
that part of the target for the next update I'm working on.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-30 16:38:13
Message-ID:	4CF52875.9040806@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes wrote:
> For the individual file sync times emitted under debug1, it would be
> very handy if the file being synced was identified, for example
> "relation base/16384/16523". Rather than being numbered sequentially
> within a given checkpoint.
>

I was numbering them sequentially so that it's straightforward to graph
the sync times in an external analysis tool, but the relation data is
helpful too. New patch reflecting all upthread suggestions is
attached. The output looks like this now at DEBUG1:

LOG: checkpoint starting: xlog
DEBUG: checkpoint sync: number=1 file=base/16424/11645
time=11589.549000 msec
DEBUG: checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec
DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=53.530000 msec
DEBUG: checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec
DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec
DEBUG: checkpoint sync: number=6 file=base/16424/16425_fsm
time=2.921000 msec
DEBUG: checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec
DEBUG: checkpoint sync: number=8 file=base/16424/16428_fsm
time=1.654000 msec
DEBUG: checkpoint sync: number=9 file=base/16424/16442 time=7.920000 msec
DEBUG: checkpoint sync: number=10 file=base/16424/16428_vm
time=2.613000 msec
DEBUG: checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec
DEBUG: checkpoint sync: number=12 file=base/16424/16437_fsm
time=2.638000 msec
DEBUG: checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec
DEBUG: checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec
DEBUG: checkpoint sync: number=15 file=base/16424/16437_vm
time=8.686000 msec
DEBUG: checkpoint sync: number=16 file=base/16424/16425_vm
time=5.984000 msec
LOG: checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s,
total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s

I kept the units for the DEBUG level ones in msec because that's a
better scale for the common really short syncs during good behavior.
But the summary info in seconds now appears at the end of the existing
"checkpoint complete" message, so only one line to parse for those
looking to analyze the gross checkpoint data. That looks to work well
enough for finding situations like the big ext3 spikes. You can easily
see one in this example by the fact that "longest=11.589 s" is almost
the entirety of "sync=11.715 s". That's the really key thing there's
currently no visibility into, that's made obvious with this patch.

This might be ready for some proper review now. I know there's at least
one blatant bug still in here I haven't found yet, related to how the
averages are computed. I saw this once:

LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log
file(s) added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s,
total=0.001 s; sync files=0, longest=0.000 s,
average=-9223372036854775808.-2147483 s

After an immediate checkpoint, so at least one path not quite right yet.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD

Attachment	Content-Type	Size
log-sync-v3.patch	text/x-patch	5.6 KB

From:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-30 18:31:29
Message-ID:	AANLkTinz+D9=+aKf=4iBY3QiE++_rTQTKSpseHJCYOtA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2010/11/30 Greg Smith <greg(at)2ndquadrant(dot)com>:
> Jeff Janes wrote:
>>
>> For the individual file sync times emitted under debug1, it would be
>> very handy if the file being synced was identified, for example
>> "relation base/16384/16523". Rather than being numbered sequentially
>> within a given checkpoint.
>>
>
> I was numbering them sequentially so that it's straightforward to graph the
> sync times in an external analysis tool, but the relation data is helpful
> too. New patch reflecting all upthread suggestions is attached. The output
> looks like this now at DEBUG1:
>
> LOG: checkpoint starting: xlog
> DEBUG: checkpoint sync: number=1 file=base/16424/11645 time=11589.549000
> msec
> DEBUG: checkpoint sync: number=2 file=base/16424/16438 time=16.148000 msec
> DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=53.530000 msec
> DEBUG: checkpoint sync: number=4 file=base/16424/16447 time=10.214000 msec
> DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=1.499000 msec
> DEBUG: checkpoint sync: number=6 file=base/16424/16425_fsm time=2.921000
> msec
> DEBUG: checkpoint sync: number=7 file=base/16424/16437.1 time=4.237000 msec
> DEBUG: checkpoint sync: number=8 file=base/16424/16428_fsm time=1.654000
> msec
> DEBUG: checkpoint sync: number=9 file=base/16424/16442 time=7.920000 msec
> DEBUG: checkpoint sync: number=10 file=base/16424/16428_vm time=2.613000
> msec
> DEBUG: checkpoint sync: number=11 file=base/16424/11618 time=1.468000 msec
> DEBUG: checkpoint sync: number=12 file=base/16424/16437_fsm time=2.638000
> msec
> DEBUG: checkpoint sync: number=13 file=base/16424/16428 time=2.883000 msec
> DEBUG: checkpoint sync: number=14 file=base/16424/16425 time=3.369000 msec
> DEBUG: checkpoint sync: number=15 file=base/16424/16437_vm time=8.686000
> msec
> DEBUG: checkpoint sync: number=16 file=base/16424/16425_vm time=5.984000
> msec
> LOG: checkpoint complete: wrote 2074 buffers (50.6%); 0 transaction log
> file(s) added, 0 removed, 3 recycled; write=0.617 s, sync=11.715 s,
> total=22.167 s; sync files=16, longest=11.589 s, average=0.724 s
>
> I kept the units for the DEBUG level ones in msec because that's a better
> scale for the common really short syncs during good behavior. But the
> summary info in seconds now appears at the end of the existing "checkpoint
> complete" message, so only one line to parse for those looking to analyze
> the gross checkpoint data. That looks to work well enough for finding
> situations like the big ext3 spikes. You can easily see one in this example
> by the fact that "longest=11.589 s" is almost the entirety of "sync=11.715
> s". That's the really key thing there's currently no visibility into,
> that's made obvious with this patch.

wonderfull.

>
> This might be ready for some proper review now. I know there's at least one
> blatant bug still in here I haven't found yet, related to how the averages
> are computed. I saw this once:
>
> LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
> added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s;
> sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s
>
> After an immediate checkpoint, so at least one path not quite right yet.
>
> --
> Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-11-30 20:15:55
Message-ID:	AANLkTinxHedsbMCq4fQT1Lx-iBed8QewmskpujWmJ4yA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 8:38 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:

Hi Greg,

Thanks for the update.

> This might be ready for some proper review now. I know there's at least one
> blatant bug still in here I haven't found yet, related to how the averages
> are computed.

Yes, the blatant bug:

average_sync_time = CheckpointStats.ckpt_longest_sync /
CheckpointStats.ckpt_sync_rels;

That should clearly be ckpt_agg_sync_time, not ckpt_longest_sync.

> I saw this once:
>
> LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
> added, 0 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.001 s;
> sync files=0, longest=0.000 s, average=-9223372036854775808.-2147483 s
>
> After an immediate checkpoint, so at least one path not quite right yet.

Not clear what the right thing to do here is. I guess we should
special case the div by zero to yield zero for the average?

The patch is in unified diff rather than context diff. Doesn't bother
me, but the wiki on doing a review says...

Cheers,

Jeff

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-03 01:16:39
Message-ID:	AANLkTi=4A34JXoZmnC04pm0CtedPRB0JNoSgJi47z_gZ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 12:15 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Tue, Nov 30, 2010 at 8:38 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>
>
> Hi Greg,
>
> Thanks for the update.
>
>
>
>> This might be ready for some proper review now. I know there's at least one
>> blatant bug still in here I haven't found yet, related to how the averages
>> are computed.
>
> Yes, the blatant bug:
>
> average_sync_time = CheckpointStats.ckpt_longest_sync /
> CheckpointStats.ckpt_sync_rels;
>
> That should clearly be ckpt_agg_sync_time, not ckpt_longest_sync.

I've attached a tiny patch to apply over yours, to deal with this and
with the case where no files are synced.

Combining this instrumentation patch with the backend sync one already
committed, the net result under log_min_messages=debug1is somewhat
undesirable in that I can now see the individual sync times for the
syncs done by the checkpoint writer, but I do not get times for the
syncs done by the backend (I only get informed of their existence).

I don't know what I would propose to fix this. Having the reportage
of sync time of backend syncs be controlled by log_checkpoints seems
somewhat perverse, but the only alternative I see is to have
log_min_messages=debug1 always report times for the backend syncs. Or
to have them go unreported altogether.

Cheers,

Jeff

Attachment	Content-Type	Size
patch	application/octet-stream	1.1 KB

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-05 21:23:36
Message-ID:	4CFC02D8.70302@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes wrote:
> I've attached a tiny patch to apply over yours, to deal with this and
> with the case where no files are synced.
>

Thanks for that. That obvious error eluded me because in most of the
patch update testing I was doing (on ext3), the longest sync was always
about the same length as the total sync time.

Attached patch (in correct diff form this time!) collects up all
changes. That includes elimination of a potential race condition if
someone changes log_checkpoints while a long sync phase is executing. I
don't know whether that can happen, and it obviously won't give accurate
stats going back to the beginning of the checkpoint in that case, but
it tries to defend aginst producing complete garbage if that value
changes out from under it.

This is the first version of this patch I feel fairly good about; no
open concerns left on my side. Jeff, since you're now the de-facto
credited reviewer of this one by virtue of suggesting bug fixes, could
you take this update out for a spin too?

> Combining this instrumentation patch with the backend sync one already
> committed, the net result under log_min_messages=debug1is somewhat
> undesirable in that I can now see the individual sync times for the
> syncs done by the checkpoint writer, but I do not get times for the
> syncs done by the backend (I only get informed of their existence).
>

On a properly working system, backend syncs shouldn't be happening. So
if you see them, I think the question shouldn't be "how long are they
taking?", it's "how do I get rid of them?" From that perspective,
knowing of their existence is sufficient to suggest the necessary tuning
changes, such as dropping bgwriter_delay.

When you get into a situation where they are showing up, a whole lot of
them can happen in a very brief period; enough that I'd actually be
concerned about the added timing overhead, something I normally don't
care about very much. That's the main reason I didn't bother
instrumenting those too--the situations where they happen are ones
already running low on resources.

Big writes for things that can only be written out at checkpoint time,
on the other hand, are unavoidable in the current design. Providing
detail on them is going to be relevant unless there's a major
refactoring of how those happen.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

Attachment	Content-Type	Size
log-sync-v4.patch	text/x-patch	7.0 KB

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-07 17:09:59
Message-ID:	AANLkTikTRcnUgN1AhzRwDXTqO=9-MtDQLMzmXG-nw0kH@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 5, 2010 at 1:23 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Jeff Janes wrote:
>>
>> I've attached a tiny patch to apply over yours, to deal with this and
>> with the case where no files are synced.
>>
>
> Thanks for that. That obvious error eluded me because in most of the patch
> update testing I was doing (on ext3), the longest sync was always about the
> same length as the total sync time.
>
> Attached patch (in correct diff form this time!) collects up all changes.
> That includes elimination of a potential race condition if someone changes
> log_checkpoints while a long sync phase is executing. I don't know whether
> that can happen, and it obviously won't give accurate stats going back to
> the beginning of the checkpoint in that case, but it tries to defend aginst
> producing complete garbage if that value changes out from under it.

It looks to me that the checkpoint writer only reloads settings
between checkpoints, so log_checkpoints cannot change under it.
But if it can, the behavior of the patch is about as reasonable as one
can expect, I think.

>
> This is the first version of this patch I feel fairly good about; no open
> concerns left on my side. Jeff, since you're now the de-facto credited
> reviewer of this one by virtue of suggesting bug fixes, could you take this
> update out for a spin too?

It applies and builds cleanly, and passes make check with -enable-cassert.
The details of log_checkpoints output are not documented, so no
document patches are needed.
I cannot think of any reasonable way to add a check into make check
framework, so I don't think one is needed.
It does what it says and what we want, and I don't see any dangers.

I haven't noticed any slow-downs. It is listed under performance, but
it is intended to diagnose, not fix, performance problems.

It seems to meet the coding style. It uses existing macros, so should
be as portable as the existing PG code is.

Comments seem sufficient. The use of INSTR_TIME_SET_ZERO as a flag
for detecting changing log_checkpoints maybe could use a comment, but
it doesn't take much thought to figure out that that is what is going
on.

I could not make it crash.

I'm marking it as ready for committer.

>
>> Combining this instrumentation patch with the backend sync one already
>> committed, the net result under log_min_messages=debug1is somewhat
>> undesirable in that I can now see the individual sync times for the
>> syncs done by the checkpoint writer, but I do not get times for the
>> syncs done by the backend (I only get informed of their existence).
>>
>
> On a properly working system, backend syncs shouldn't be happening. So if
> you see them, I think the question shouldn't be "how long are they taking?",
> it's "how do I get rid of them?" From that perspective, knowing of their
> existence is sufficient to suggest the necessary tuning changes, such as
> dropping bgwriter_delay.

OK. I was trying to figure out if the syncs were entangled (all ended
at about the same time, regardless of when they started). But I can
see that that is basically a hacker thing to do, not production, so I
can add in my own logging at the end of the backend sync for my
purposes.

>
> When you get into a situation where they are showing up, a whole lot of them
> can happen in a very brief period;

In my test cases, the syncs that the backends were doing were almost
always to the same file that the checkpoint writer was already choking
on (so they are entangled simply by virtue of that). So very quickly
all the backends hit the same wall and thunked to a halt. This is
probably a feature of trying to use pgbench as the basis to get a very
artificial model. I need to find a way to get dozens or hundreds of
files going. Maybe pgbench could be hacked so that each connection
used a different schema.

Cheers,

Jeff

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-10 02:26:36
Message-ID:	4D018FDC.4020902@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes wrote:
> In my test cases, the syncs that the backends were doing were almost
> always to the same file that the checkpoint writer was already choking
> on (so they are entangled simply by virtue of that). So very quickly
> all the backends hit the same wall and thunked to a halt. This is
> probably a feature of trying to use pgbench as the basis to get a very
> artificial model.

Yes--pgbench has some problems like you describe, ones that are a bit
different than the way I've seen fsync writes get in each other's way in
the production systems I've looked at. That's good if you really want
to provoke this behavior, which is one reason why I've used as an
example for my patches so far (the other being that it's already
available in everyone's installation). But it's tough to get it to act
more like a real-world system, which don't have quite so many localized
updates, without cranking the scale way up. And that then tends to
aggravate other problems too.

The 8.3 checkpoint spreading work also got some useful results using the
dbt-2 benchmark. I'm at the point where I think I need to return to
that test program for what I'm doing now. I'd encourage you to try that
out too if you get a chance.

Thanks for the feedback and the review. I hope you appreciate now why I
suggested you wait for the stuff I was submitting before getting back
into the sorted checkpoint topic again. That should be a lot easier to
make sense of with this instrumentation in place.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-13 03:20:52
Message-ID:	AANLkTin7f2xKCx7W4zp_-ADoOy9Yp9dpmua0BEB_Ljae@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 5, 2010 at 4:23 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Jeff Janes wrote:
>>
>> I've attached a tiny patch to apply over yours, to deal with this and
>> with the case where no files are synced.
>>
>
> Thanks for that. That obvious error eluded me because in most of the patch
> update testing I was doing (on ext3), the longest sync was always about the
> same length as the total sync time.
>
> Attached patch (in correct diff form this time!) collects up all changes.
> That includes elimination of a potential race condition if someone changes
> log_checkpoints while a long sync phase is executing. I don't know whether
> that can happen, and it obviously won't give accurate stats going back to
> the beginning of the checkpoint in that case, but it tries to defend aginst
> producing complete garbage if that value changes out from under it.
>
> This is the first version of this patch I feel fairly good about; no open
> concerns left on my side. Jeff, since you're now the de-facto credited
> reviewer of this one by virtue of suggesting bug fixes, could you take this
> update out for a spin too?

I took a look at this and it looks generally good, but I'm wondering
why md.c is converting the results from an exact value to a floating
point, only to have xlog.c turn around and convert back to an integer.
I think it could just return milliseconds directly, or if you're
worried about a checkpoint that takes more than 24 days to complete,
seconds and microseconds. Or am I missing something?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-13 08:19:46
Message-ID:	4D05D722.5040205@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> I'm wondering why md.c is converting the results from an exact value to a floating
> point, only to have xlog.c turn around and convert back to an integer.
> I think it could just return milliseconds directly, or if you're
> worried about a checkpoint that takes more than 24 days to complete,
> seconds and microseconds.

md.c is printing the value as a float, so I converted early to a double
and then percolated it upward from there. More an artifact of how the
code grew from its initial form than an intentional decision. I see your
point that making elapsed, total_elapsed, ckpt_agg_sync_time, and
ckpt_longest_sync all the same type of integer that
INSTR_TIME_GET_MICROSEC returns would reduce the potential for rounding
abberations. I could do another rev of the patch tomorrow with that
change if you'd prefer it that way. I don't have a strong opinion about
that implementation detail; given that xlog.c is printing a less
fine-grained time anyway (seconds with 3 digits vs. msec with 3 digits)
it seems unlikely to run into a real problem here.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-13 12:56:24
Message-ID:	AANLkTik_AoKf_oXMkrkwr8NHoavL4mkkzzv1t+dG9Nf8@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 13, 2010 at 3:19 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Robert Haas wrote:
>>
>> I'm wondering why md.c is converting the results from an exact value to a
>> floating
>> point, only to have xlog.c turn around and convert back to an integer.
>> I think it could just return milliseconds directly, or if you're
>> worried about a checkpoint that takes more than 24 days to complete,
>> seconds and microseconds.
>
> md.c is printing the value as a float, so I converted early to a double and
> then percolated it upward from there. More an artifact of how the code grew
> from its initial form than an intentional decision. I see your point that
> making elapsed, total_elapsed, ckpt_agg_sync_time, and ckpt_longest_sync all
> the same type of integer that INSTR_TIME_GET_MICROSEC returns would reduce
> the potential for rounding abberations. I could do another rev of the patch
> tomorrow with that change if you'd prefer it that way. I don't have a
> strong opinion about that implementation detail; given that xlog.c is
> printing a less fine-grained time anyway (seconds with 3 digits vs. msec
> with 3 digits) it seems unlikely to run into a real problem here.

Yeah, it'd probably be OK anyway, but I think it doesn't really cost
us anything to avoid the unnecessary conversion steps, so count me as
a vote for doing it that way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-14 12:22:16
Message-ID:	4D076178.1070109@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> I took a look at this and it looks generally good, but I'm wondering
> why md.c is converting the results from an exact value to a floating
> point, only to have xlog.c turn around and convert back to an integer.
> I think it could just return milliseconds directly, or if you're
> worried about a checkpoint that takes more than 24 days to complete,
> seconds and microseconds.

Attached patch now does something like this, except without the
roll-over concern. INSTR_TIME_GET_MICROSEC returns a uint64 value. I
just made that the storage format for all these values until they're
converted for display. Test output:

LOG: checkpoint starting: xlog
DEBUG: checkpoint sync: number=1 file=base/16385/16480 time=10422.859 msec
DEBUG: checkpoint sync: number=2 file=base/16385/16475_vm time=2896.614
msec
DEBUG: checkpoint sync: number=3 file=base/16385/16475.1 time=57.836 msec
DEBUG: checkpoint sync: number=4 file=base/16385/16466 time=20.080 msec
DEBUG: checkpoint sync: number=5 file=base/16385/16463 time=74.926 msec
DEBUG: checkpoint sync: number=6 file=base/16385/16482 time=74.263 msec
DEBUG: checkpoint sync: number=7 file=base/16385/16475_fsm time=7.062 msec
DEBUG: checkpoint sync: number=8 file=base/16385/16475 time=35.164 msec
LOG: checkpoint complete: wrote 2143 buffers (52.3%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=1.213 s, sync=13.589 s,
total=24.744 s; sync files=8, longest=10.422 s, average=1.698 s

This shows the hard truncation used, so 10422.859 msec becomes 10.422
s. I don't think allowing up to 0.999ms of error there is a problem
given the expected scale. But since none of the precision is lost until
the end, that could be changed with only touching the final display
formatting conversion of the value. Following your general idea
further, why throw away any resolution inside of md.c; let xlog.c decide
how to show it.

Note that I also fixed the DEBUG level lines to only show their actual
precision. Before that was printing 6 digits to the right of the
decimal point each time, the last three of which were always 0.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Attachment	Content-Type	Size
log-sync-v5.patch	text/x-patch	7.0 KB

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-14 14:29:05
Message-ID:	1292336412-sup-4774@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I gave this patch a look and it seems pretty good to me, except that I'm
uncomfortable with the idea of mdsync filling in the details for
CheckpointStats fields directly. Would it work to pass a struct (say
SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to
mdsync, have this function fill it, and return it back so that
CheckPointBuffers copies the data from this struct into CheckpointStats?

Another minor nitpick: inside the block when you call FileSync, why
check for log_checkpoints at all? Seems to me that just checking for
zero of sync_start should be enough. Alternatively, seems simpler to
just have a local var with the value of log_checkpoints at the start of
mdsync and use that throughout the function. (Surely if someone turns
off log_checkpoints in the middle of a checkpoint, it's not really a
problem that we collect and report stats during that checkpoint.)

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-14 14:34:55
Message-ID:	AANLkTi=F+dSkugxK=9aq_zLM1biYPXzZCL5BVvMbpnFz@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 14, 2010 at 9:29 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> I gave this patch a look and it seems pretty good to me, except

Err, woops. I just committed this as-is. Sorry.

> that I'm
> uncomfortable with the idea of mdsync filling in the details for
> CheckpointStats fields directly. Would it work to pass a struct (say
> SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to
> mdsync, have this function fill it, and return it back so that
> CheckPointBuffers copies the data from this struct into CheckpointStats?
>
> Another minor nitpick: inside the block when you call FileSync, why
> check for log_checkpoints at all? Seems to me that just checking for
> zero of sync_start should be enough. Alternatively, seems simpler to
> just have a local var with the value of log_checkpoints at the start of
> mdsync and use that throughout the function. (Surely if someone turns
> off log_checkpoints in the middle of a checkpoint, it's not really a
> problem that we collect and report stats during that checkpoint.)

Neither of these things bothers me, but we can certainly discuss...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-14 16:47:11
Message-ID:	1292345149-sup-4256@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Robert Haas's message of mar dic 14 11:34:55 -0300 2010:
> On Tue, Dec 14, 2010 at 9:29 AM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
> > I gave this patch a look and it seems pretty good to me, except
>
> Err, woops. I just committed this as-is. Sorry.

I noticed :-)

> > that I'm
> > uncomfortable with the idea of mdsync filling in the details for
> > CheckpointStats fields directly. Would it work to pass a struct (say
> > SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to
> > mdsync, have this function fill it, and return it back so that
> > CheckPointBuffers copies the data from this struct into CheckpointStats?
> >
> > Another minor nitpick: inside the block when you call FileSync, why
> > check for log_checkpoints at all? Seems to me that just checking for
> > zero of sync_start should be enough. Alternatively, seems simpler to
> > just have a local var with the value of log_checkpoints at the start of
> > mdsync and use that throughout the function. (Surely if someone turns
> > off log_checkpoints in the middle of a checkpoint, it's not really a
> > problem that we collect and report stats during that checkpoint.)
>
> Neither of these things bothers me, but we can certainly discuss...

Well, the second one was just about simplifying it, so never mind that.
But referring to CheckpointStats in md.c seems to me to be a violation
of modularity that ought to be fixed.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-14 18:49:43
Message-ID:	AANLkTikWDQcr4MWU0zJbTjW_hoiiWAKNhYRu9APWT21z@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 14, 2010 at 11:47 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
>> > that I'm
>> > uncomfortable with the idea of mdsync filling in the details for
>> > CheckpointStats fields directly. Would it work to pass a struct (say
>> > SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to
>> > mdsync, have this function fill it, and return it back so that
>> > CheckPointBuffers copies the data from this struct into CheckpointStats?
>
> But referring to CheckpointStats in md.c seems to me to be a violation
> of modularity that ought to be fixed.

Hmm. I guess I can't get worked up about it. We could do what you
propose, but I'm not sure what purpose it would serve. It's not as if
mdsync() can possibly serve any other purpose other than to be the
guts of a checkpoint. It seems to me that if we really wanted to get
this disentangled from the checkpoint stats stuff we'd also need to
think about moving that elog(DEBUG1) I added out of the function
somehow, but I think that would just be a lot of notional complexity
for no particular benefit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-15 14:22:55
Message-ID:	4D08CF3F.3070004@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera wrote:
> I gave this patch a look and it seems pretty good to me, except that I'm
> uncomfortable with the idea of mdsync filling in the details for
> CheckpointStats fields directly. Would it work to pass a struct (say
> SmgrSyncStats) from CheckPointBuffers to smgrsync and from there to
> mdsync, have this function fill it, and return it back so that
> CheckPointBuffers copies the data from this struct into CheckpointStats?
>

That was originally how I planned to write this bit of code. When I
realized that the CheckpointStats structure was already visible there
and stuffed with details that ultimately go into the same output line at
the end, it just didn't seem worth the extra code complexity. The
abstraction layer around md.c was not exactly airtight before I poked
that extra little hole in there, and I was aiming via the principal of a
smaller diff usually being the better patch .

If you feel strongly that the result led to a bad abstraction violation,
I'll submit a patch to refactor it to pass a structure instead before
the next CF. I appreciate your concern, I'm just not sure it's worth
spending time on. What I'd really like to do is refactor out major
parts of the leaky md/smgr layers altogether instead, but that's
obviously a bigger project.

> Another minor nitpick: inside the block when you call FileSync, why
> check for log_checkpoints at all? Seems to me that just checking for
> zero of sync_start should be enough. Alternatively, seems simpler to
> just have a local var with the value of log_checkpoints at the start of
> mdsync and use that throughout the function. (Surely if someone turns
> off log_checkpoints in the middle of a checkpoint, it's not really a
> problem that we collect and report stats during that checkpoint.)
>

And now you're just getting picky! This is a useful observation though,
and I'll try to include that fix along with the next general checkpoint
overhaul patch I submit. Doesn't seem worth going through the trouble
of committing that minor rework on its own, I'll slip it into the next
useful thing that touches this area I do. Thanks for the hint, this
would work better than what I did.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-15 14:53:02
Message-ID:	AANLkTimfCQYq4QdKi_ocsJWu_UW8LTnrEd7MGVrk=pBA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 15, 2010 at 9:22 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> patch I submit. Doesn't seem worth going through the trouble of committing
> that minor rework on its own, I'll slip it into the next useful thing that
> touches this area I do. Thanks for the hint, this would work better than
> what I did.

Well, if I'm the one committing it, I'll pull that part out again and
commit it separately. Not sure if that affects your calculus, but I
much prefer patches that don't try to do ancillary things along the
way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-16 11:48:17
Message-ID:	4D09FC81.5080700@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> On Wed, Dec 15, 2010 at 9:22 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>
>> patch I submit. Doesn't seem worth going through the trouble of committing
>> that minor rework on its own, I'll slip it into the next useful thing that
>> touches this area I do. Thanks for the hint, this would work better than
>> what I did.
>>
>
> Well, if I'm the one committing it, I'll pull that part out again and
> commit it separately. Not sure if that affects your calculus, but I
> much prefer patches that don't try to do ancillary things along the
> way.
>

I meant that I'd bundle it into the block of time I spend on that, and
likely submit with something else that touches the same area. Obviously
the correction patch would be better on its own when being handed over
to a committer.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Instrument checkpoint sync calls
Date:	2010-12-16 13:55:59
Message-ID:	AANLkTiknFMzcMYqfHDxfCpYWkbTrr9mT+0yxt=FYA3Bc@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 16, 2010 at 6:48 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> I meant that I'd bundle it into the block of time I spend on that, and
> likely submit with something else that touches the same area. Obviously the
> correction patch would be better on its own when being handed over to a
> committer.

Oh, that's very different. Never mind!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company