Quick Links

Re: postgresql latency & bgwriter not doing its job

Lists:	pgsql-hackers

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	postgresql latency & bgwriter not doing its job
Date:	2014-08-25 20:23:40
Message-ID:	alpine.DEB.2.10.1408251900211.11151@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello pgdevs,

I've been playing with pg for some time now to try to reduce the maximum
latency of simple requests, to have a responsive server under small to
medium load.

On an old computer with a software RAID5 HDD attached, pgbench
simple update script run for a some time (scale 100, fillfactor 95)

pgbench -M prepared -N -c 2 -T 500 -P 1 ...

gives 300 tps. However this performance is really +1000 tps for a few
seconds followed by 16 seconds at about 0 tps for the checkpoint induced
IO storm. The server is totally unresponsive 75% of the time. That's
bandwidth optimization for you. Hmmm... why not.

Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above
max):

pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...

The same thing more or less happens in a delayed fashion... You get 50 tps
for some time, followed by sections of 15 seconds at 0 tps for the
checkpoint when the segments are full... the server is unresponsive about
10% of the time (one in ten transaction is late by more than 200 ms).

It is not satisfying, pg should be able to handle that load easily.

The culprit I found is "bgwriter", which is basically doing nothing to
prevent the coming checkpoint IO storm, even though there would be ample
time to write the accumulating dirty pages so that checkpoint would find a
clean field and pass in a blink. Indeed, at the end of the 500 seconds
throttled test, "pg_stat_bgwriter" says:

buffers_checkpoint = 19046
buffers_clean = 2995

Which suggest that bgwriter took on him to write only 6 pages per second,
where at least 50 would have been needed for the load, and could have been
handled by the harware without any problem.

I have not found any mean to force bgwriter to send writes when it can.
(Well, I have: create a process which sends "CHECKPOINT" every 0.2
seconds... it works more or less, but this is not my point:-)

Bgwriter control parameters allow to control the maximum number of pages
(bgwriter_lru_maxpages) written per round (bgwriter_delay), and a
multiplier (bgwriter_lru_multiplier) which controls some heuristics to
estimate how many pages should be needed so as to make them available.
This may be nice in some settings, but is not adapted to the write
oriented OTPL load tested with pgbench.

The problem is that with the update load on a fitting in memory database
there is not that much need of "new" pages, even if pages are being
dirtied (about 50 per seconds), so it seems that the heuristics decides
not to write much. The net result of all this cleverness is that when the
checkpoint arrives, several thousand pages have to be written and the
server is offline for some time.

ISTM that bgwriter lacks at least some "min page" setting it could be
induced to free this many pages if it can. That would be a start.

A better feature would be that it adapts itself to take advantage of the
available IOPS, depending on the load induce by other tasks (vacuum,
queries...), in a preventive manner, so as to avoid delaying what can be
done right now under a small load, and thus avoid later IO storms. This
would suggest that some setting would provide the expected IOPS capability
of the underlying hardware, as some already suggest the expected available
memory.

Note that this preventive approach could also improve the "bandwith"
measure: currently when pgbench is running at maximum speed before the
checkpoint storm, nothing is written to disk but WAL, although it could
probably also write some dirty pages. When the checkpoints arrives, less
pages would need to be written, so the storm would be shorter.

Any thoughts on this latency issue? Am I missing something?

--
Fabien.

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-25 22:45:58
Message-ID:	53FBBCA6.90102@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/25/2014 01:23 PM, Fabien COELHO wrote:
>
> Hello pgdevs,
>
> I've been playing with pg for some time now to try to reduce the maximum
> latency of simple requests, to have a responsive server under small to
> medium load.
>
> On an old computer with a software RAID5 HDD attached, pgbench simple
> update script run for a some time (scale 100, fillfactor 95)
>
> pgbench -M prepared -N -c 2 -T 500 -P 1 ...
>
> gives 300 tps. However this performance is really +1000 tps for a few
> seconds followed by 16 seconds at about 0 tps for the checkpoint induced
> IO storm. The server is totally unresponsive 75% of the time. That's
> bandwidth optimization for you. Hmmm... why not.

So I think that you're confusing the roles of bgwriter vs. spread
checkpoint. What you're experiencing above is pretty common for
nonspread checkpoints on slow storage (and RAID5 is slow for DB updates,
no matter how fast the disks are), or for attempts to do spread
checkpoint on filesystems which don't support it (e.g. Ext3, HFS+). In
either case, what's happening is that the *OS* is freezing all logical
and physical IO while it works to write out all of RAM, which makes me
suspect you're using Ext3 or HFS+.

Making the bgwriter more aggressive adds a significant risk of writing
the same pages multiple times between checkpoints, so it's not a simple fix.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-25 22:55:34
Message-ID:	20140825225534.GA21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2014-08-25 22:23:40 +0200, Fabien COELHO wrote:
> seconds followed by 16 seconds at about 0 tps for the checkpoint induced IO
> storm. The server is totally unresponsive 75% of the time. That's bandwidth
> optimization for you. Hmmm... why not.
>
> Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above
> max):
>
> pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...
>
> The same thing more or less happens in a delayed fashion... You get 50 tps
> for some time, followed by sections of 15 seconds at 0 tps for the
> checkpoint when the segments are full... the server is unresponsive about
> 10% of the time (one in ten transaction is late by more than 200 ms).

That's ext4 I guess? Did you check whether xfs yields a, err, more
predictable performance?

> It is not satisfying, pg should be able to handle that load easily.
>
> The culprit I found is "bgwriter", which is basically doing nothing to
> prevent the coming checkpoint IO storm, even though there would be ample
> time to write the accumulating dirty pages so that checkpoint would find a
> clean field and pass in a blink.

While I agree that the current bgwriter implementation is far from good,
note that this isn't the bgwriter's job. Its goal is to avoid backends
from having to write out buffers themselves. I.e. that there are clean
victim buffers when shared_buffers < working set.

Note that it would *not* be a good idea to make the bgwriter write out
everything, as much as possible - that'd turn sequential write io into
random write io.

Greetings,

Andres Freund

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 02:09:36
Message-ID:	CAMkU=1xd4LoLVTiySiTxzoBOoGXMx50XO2C88_FbkMcNMPx43w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday, August 25, 2014, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:

>
>
> The culprit I found is "bgwriter", which is basically doing nothing to
> prevent the coming checkpoint IO storm, even though there would be ample
> time to write the accumulating dirty pages so that checkpoint would find a
> clean field and pass in a blink. Indeed, at the end of the 500 seconds
> throttled test, "pg_stat_bgwriter" says:
>

Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i?
You don't want your steady state stats polluted by the bulk load.

>
> buffers_checkpoint = 19046
> buffers_clean = 2995
>

Out of curiosity, what does buffers_backend show?

In any event, this almost certainly is a red herring. Whichever of the
three ways are being used to write out the buffers, it is the checkpointer
that is responsible for fsyncing them, and that is where your drama is
almost certainly occurring. Writing out with one path rather than a
different isn't going to change things, unless you change the fsync.

Also, are you familiar with checkpoint_completion_target, and what is it
set to?

Cheers,

Jeff

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 04:05:43
Message-ID:	CAA4eK1+xiW-VoMaAFbRRR7pCNP8txguXWCVp70Yg3m5XRmuVbQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 26, 2014 at 1:53 AM, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:
>
>
> Hello pgdevs,
>
> I've been playing with pg for some time now to try to reduce the maximum
latency of simple requests, to have a responsive server under small to
medium load.
>
> On an old computer with a software RAID5 HDD attached, pgbench simple
update script run for a some time (scale 100, fillfactor 95)
>
> pgbench -M prepared -N -c 2 -T 500 -P 1 ...
>
> gives 300 tps. However this performance is really +1000 tps for a few
seconds followed by 16 seconds at about 0 tps for the checkpoint induced IO
storm. The server is totally unresponsive 75% of the time. That's bandwidth
optimization for you. Hmmm... why not.
>
> Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above
max):
>
> pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...
>
> The same thing more or less happens in a delayed fashion... You get 50
tps for some time, followed by sections of 15 seconds at 0 tps for the
checkpoint when the segments are full... the server is unresponsive about
10% of the time (one in ten transaction is late by more than 200 ms).

I think another thing to know here is why exactly checkpoint
storm is causing tps to drop so steeply. One reason could be
that backends might need to write more WAL due Full_Page_Writes,
another could be contention around buffer content_lock.

To dig more about the reason, the same tests can be tried
by making Full_Page_Writes = off and/or
synchronous_commit = off to see if WAL writes are causing
tps to go down.

Similarly for checkpoints, use checkpoint_completion_target to
spread the checkpoint_writes as suggested by Jeff as well to see
if that can mitigate the problem.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 06:12:48
Message-ID:	alpine.DEB.2.10.1408260733240.4394@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Josh,

> So I think that you're confusing the roles of bgwriter vs. spread
> checkpoint. What you're experiencing above is pretty common for
> nonspread checkpoints on slow storage (and RAID5 is slow for DB updates,
> no matter how fast the disks are), or for attempts to do spread
> checkpoint on filesystems which don't support it (e.g. Ext3, HFS+). In
> either case, what's happening is that the *OS* is freezing all logical
> and physical IO while it works to write out all of RAM, which makes me
> suspect you're using Ext3 or HFS+.

I'm using ext4 on debian wheezy with postgresqk 9.4b2.

I agree that the OS may be able to help, but this aspect does not
currently work for me at all out of the box. The "all of RAM" is really a
few thousands 8 kB pages written randomly, a few dozen MB.

Also, if pg needs advanced OS tweaking to handle a small load, ISTM that
it fails at simplicity:-(

As for checkpoint spreading, raising checkpoint_completion_target to 0.9
degrades the situation (20% of transactions are more than 200 ms late
instead of 10%, bgwriter wrote less that 1 page per second, on on 500s
run). So maybe there is a bug here somewhere.

> Making the bgwriter more aggressive adds a significant risk of writing
> the same pages multiple times between checkpoints, so it's not a simple fix.

Hmmm... This must be balanced with the risk of being offline. Not all
people are interested in throughput at the price of latency, so there
could be settings that help latency, even at the price of reducing
throughput (average tps). After that, it is the administrator choice to
set pg for higher throughput or lower latency.

Note that writing some "least recently used" page multiple times does not
seems to be any issue at all for me under small/medium load, especially as
the system has nothing else to do: if you have nothing else to do, there
is no cost in writing a page, even if you may have to write it again some
time later, and it helps prevent dirty pages accumulation. So it seems to
me that pg can help, it is not only/merely an OS issue.

--
Fabien.

From:	Fabien COELHO <fabien(dot)coelho(at)mines-paristech(dot)fr>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 06:25:58
Message-ID:	alpine.DEB.2.10.1408260818500.4394@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Jeff,

>> The culprit I found is "bgwriter", which is basically doing nothing to
>> prevent the coming checkpoint IO storm, even though there would be ample
>> time to write the accumulating dirty pages so that checkpoint would find a
>> clean field and pass in a blink. Indeed, at the end of the 500 seconds
>> throttled test, "pg_stat_bgwriter" says:
>
> Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i?

Yes, I did.

> You don't want your steady state stats polluted by the bulk load.

Sure!

>> buffers_checkpoint = 19046
>> buffers_clean = 2995
>
> Out of curiosity, what does buffers_backend show?

buffers_backend = 157

> In any event, this almost certainly is a red herring.

Possibly! It is pretty easy to reproduce... can anyone try?

> Whichever of the three ways are being used to write out the buffers, it
> is the checkpointer that is responsible for fsyncing them, and that is
> where your drama is almost certainly occurring. Writing out with one
> path rather than a different isn't going to change things, unless you
> change the fsync.

Well, ISTM that the OS does not need to wait for fsync to start writing
pages if it has received one minute of buffer writes at 50 writes per
second, some scheduler should start handling the flow somewhere... So if
bgwriter was to write the buffers is would help, but maybe there is a
better way.

> Also, are you familiar with checkpoint_completion_target, and what is it
> set to?

The default 0.5. Moving to 0.9 seems to worsen the situation.

--
Fabien.

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 06:41:27
Message-ID:	CAMkU=1w-Jwn3=C1JKpmU9RCvdu-jG6pxnLj5AOovkPozXqH04g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday, August 25, 2014, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:

>
> I have not found any mean to force bgwriter to send writes when it can.
> (Well, I have: create a process which sends "CHECKPOINT" every 0.2
> seconds... it works more or less, but this is not my point:-)
>

There is scan_whole_pool_milliseconds, which currently forces bgwriter to
circle the buffer pool at least once every 2 minutes. It is currently
fixed, but it should be trivial to turn it into an experimental guc that
you could use to test your hypothesis.

Cheers,

Jeff

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 06:43:01
Message-ID:	alpine.DEB.2.10.1408260839470.4394@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

[oops, wrong from, resent...]

Hello Jeff,

Yes, I did.

> You don't want your steady state stats polluted by the bulk load.

Sure!

>> buffers_checkpoint = 19046
>> buffers_clean = 2995
>
> Out of curiosity, what does buffers_backend show?

buffers_backend = 157

> In any event, this almost certainly is a red herring.

Possibly. It is pretty easy to reproduce, though.

Well, I agree partially. ISTM that the OS does not need to wait for fsync
to start writing pages if it is receiving one minute of buffer writes at
50 writes per second, I would have thought that some scheduler should
start handling the flow before fsync... So I thought that if bgwriter was
to write the buffers is would help, but maybe there is a better way.

> Also, are you familiar with checkpoint_completion_target, and what is it
> set to?

The default 0.5. Moving to 0.9 seems to worsen the situation.

--
Fabien.

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 07:23:40
Message-ID:	alpine.DEB.2.10.1408260845110.4394@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Amit,

> I think another thing to know here is why exactly checkpoint
> storm is causing tps to drop so steeply.

Yep. Actually it is not strictly 0, but a "few" tps that I rounded to 0.

progress: 63.0 s, 47.0 tps, lat 2.810 ms stddev 5.194, lag 0.354 ms
progress: 64.1 s, 11.9 tps, lat 81.481 ms stddev 218.026, lag 74.172 ms
progress: 65.2 s, 1.9 tps, lat 950.455 ms stddev 125.602, lag 1421.203 ms
progress: 66.1 s, 4.5 tps, lat 604.790 ms stddev 440.073, lag 2418.128 ms
progress: 67.2 s, 6.0 tps, lat 322.440 ms stddev 68.276, lag 3146.302 ms
progress: 68.0 s, 2.4 tps, lat 759.509 ms stddev 62.141, lag 4229.239 ms
progress: 69.4 s, 3.6 tps, lat 440.335 ms stddev 369.207, lag 4842.371 ms

Transactions are 4.8 seconds behind schedule at this point.

> One reason could be that backends might need to write more WAL due
> Full_Page_Writes, another could be contention around buffer
> content_lock. To dig more about the reason, the same tests can be tried
> by making Full_Page_Writes = off and/or synchronous_commit = off to see
> if WAL writes are causing tps to go down.

Given the small flow of updates, I do not think that there should be
reason to get that big a write contention between WAL & checkpoint.

If tried with "full_page_write = off" for 500 seconds: same overall
behavior, 8.5% of transactions are stuck (instead of 10%). However in
details pg_stat_bgwriter is quite different:

buffers_checkpoint = 13906
buffers_clean = 20748
buffers_backend = 472

That seems to suggest that bgwriter did some stuff for once, but that did
not change much the result in the end. This would imply that my suggestion
to make bgwriter write more would not fix the problem alone.

With "synchronous_commit = off", the situation is much improved, with only
0.3% of transactions stuck. Not a surprise. However, I would not recommand
that as a solution:-)

Currently, the only way I was able to "solve" the issue while still
writing to disk is to send "CHECKPOINT" every 0.2s, as if I had set
"checkpoint_timeout = 0.2s" (although this is not currently allowed).

> Similarly for checkpoints, use checkpoint_completion_target to
> spread the checkpoint_writes as suggested by Jeff as well to see
> if that can mitigate the problem.

I had already tried, and retried after Jeff suggestion. This does not seem
to mitigate anything, on the contrary.

--
Fabien.

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 08:02:59
Message-ID:	alpine.DEB.2.10.1408260929260.4394@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello again,

>> I have not found any mean to force bgwriter to send writes when it can.
>> (Well, I have: create a process which sends "CHECKPOINT" every 0.2
>> seconds... it works more or less, but this is not my point:-)
>
> There is scan_whole_pool_milliseconds, which currently forces bgwriter to
> circle the buffer pool at least once every 2 minutes. It is currently
> fixed, but it should be trivial to turn it into an experimental guc that
> you could use to test your hypothesis.

I recompiled with the variable coldly set to 1000 instead of 120000. The
situation is slightly degraded (15% of transactions were above 200 ms
late). However it seems that bgwriter did not write much more pages:

buffers_checkpoint = 26065
buffers_clean = 5263
buffers_backend = 367

Or I may have a problem interpreting pg_stat_bgwriter.

It seems that changing this value is not enough to persuade bgwriter to
write more pages. Or I may have done something wrong, but I do not know
what.

--
Fabien.

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 08:18:18
Message-ID:	20140826081818.GF21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-26 08:12:48 +0200, Fabien COELHO wrote:
> As for checkpoint spreading, raising checkpoint_completion_target to 0.9
> degrades the situation (20% of transactions are more than 200 ms late
> instead of 10%, bgwriter wrote less that 1 page per second, on on 500s run).
> So maybe there is a bug here somewhere.

What are the other settings here? checkpoint_segments,
checkpoint_timeout, wal_buffers?

Could you show the output of log_checkpoints during that run? Checkpoint
spreading only works halfway efficiently if all checkpoints are
triggered by "time" and not by "xlog".

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 08:25:29
Message-ID:	alpine.DEB.2.10.1408261012210.4394@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Andres,

>> checkpoint when the segments are full... the server is unresponsive about
>> 10% of the time (one in ten transaction is late by more than 200 ms).
>
> That's ext4 I guess?

Yes!

> Did you check whether xfs yields a, err, more predictable performance?

No. I cannot test that easily without reinstalling the box. I did some
quick tests with ZFS/FreeBSD which seemed to freeze the same, but not in
the very same conditions. Maybe I could try again.

> [...] Note that it would *not* be a good idea to make the bgwriter write
> out everything, as much as possible - that'd turn sequential write io
> into random write io.

Hmmm. I'm not sure it would be necessary the case, it depends on how
bgwriter would choose the pages to write? If they are chosen randomly then
indeed that could be bad. If there is a big sequential write, should not
the backend do the write directly anyway? ISTM that currently checkpoint
is mostly random writes anyway, at least with the OLTP write load of
pgbench. I'm just trying to be able to start them ealier so that they can
be completed quickly.

So although bgwriter is not the solution, ISTM that pg has no reason to
wait for minutes before starting to write dirty pages, if it has nothing
else to do. If the OS does some retention later and cannot spread the
load, as Josh suggest, this could also be a problem, but currently the OS
seems not to have much to write (but WAL) till the checkpoint.

--
Fabien.

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 08:34:46
Message-ID:	20140826083446.GG21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-26 10:25:29 +0200, Fabien COELHO wrote:
> >Did you check whether xfs yields a, err, more predictable performance?
>
> No. I cannot test that easily without reinstalling the box. I did some quick
> tests with ZFS/FreeBSD which seemed to freeze the same, but not in the very
> same conditions. Maybe I could try again.

After Robert and I went to LSF/MM this spring I sent a test program for
precisely this problem and while it could *crash* machines when using
ext4, xfs yielded much more predictable performance. There's a problem
with priorization of write vs read IO that's apparently FS dependent.

> >[...] Note that it would *not* be a good idea to make the bgwriter write
> >out everything, as much as possible - that'd turn sequential write io into
> >random write io.
>
> Hmmm. I'm not sure it would be necessary the case, it depends on how
> bgwriter would choose the pages to write? If they are chosen randomly then
> indeed that could be bad.

The essentially have to be random to fulfil it's roal of reducing the
likelihood of a backend having to write out a buffer itself. Consider
how the clock sweep algorithm (not that I am happy with it) works. When
looking for a new victim buffer all backends scan the buffer cache in
one continuous cycle. If they find a buffer with a usagecount==0 they'll
use that one and throw away its contents. Otherwise they reduce
usagecount by 1 and move on. What the bgwriter *tries* to do is to write
out buffers with usagecount==0 that are dirty and will soon be visited
in the clock cycle. To avoid having the backends to do that.

> If there is a big sequential write, should not the
> backend do the write directly anyway? ISTM that currently checkpoint is
> mostly random writes anyway, at least with the OLTP write load of pgbench.
> I'm just trying to be able to start them ealier so that they can be
> completed quickly.

If the IO scheduling worked - which it really doesn't in many cases -
there'd really be no need to make it finish fast. I think you should try
to tune spread checkpoints to have less impact, not make bgwriter do
something it's not written for.

> So although bgwriter is not the solution, ISTM that pg has no reason to wait
> for minutes before starting to write dirty pages, if it has nothing else to
> do.

That precisely *IS* a spread checkpoint.

> If the OS does some retention later and cannot spread the load, as Josh
> suggest, this could also be a problem, but currently the OS seems not to
> have much to write (but WAL) till the checkpoint.

The actual problem is that the writes by the checkpointer - done in the
background - aren't flushed out eagerly enough out of the OS's page
cache. Then, when the final phase of the checkpoint comes, where
relation files need to be fsynced, some filesystems essentially stal
while trying to write out lots and lots of dirty buffers.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 08:49:31
Message-ID:	alpine.DEB.2.10.1408261026060.4394@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> What are the other settings here? checkpoint_segments,
> checkpoint_timeout, wal_buffers?

They simply are the defaults:

checkpoint_segments = 3
checkpoint_timeout = 5min
wal_buffers = -1

I did some test checkpoint_segments = 1, the problem is just more frequent
but shorter. I also reduced wal_segsize down to 1MB, which also made it
even more frequent but much shorter, so the overall result was an
improvement with 5% to 3% of transactions lost instead of 10-14%, if I
recall correctly. I have found no solution on this path.

> Could you show the output of log_checkpoints during that run? Checkpoint
> spreading only works halfway efficiently if all checkpoints are
> triggered by "time" and not by "xlog".

I do 500 seconds tests, so there could be at most 2 timeout triggered
checkpoints. Given the write load it takes about 2 minutes to fill the 3
16 MB buffers (8 kb * 50 tps (there is one page modified per transaction)
* 120 s ~ 48 MB), so checkpoints are triggered by xlog. The maths are
consistent with logs (not sure which prooves which, though:-):

LOG: received SIGHUP, reloading configuration files
LOG: parameter "log_checkpoints" changed to "on"
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 5713 buffers (34.9%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=51.449 s, sync=4.857 s,
total=56.485 s; sync files=12, longest=2.160 s, average=0.404 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6235 buffers (38.1%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=53.500 s, sync=5.102 s,
total=58.670 s; sync files=8, longest=2.689 s, average=0.637 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6250 buffers (38.1%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=53.888 s, sync=4.504 s,
total=58.495 s; sync files=8, longest=2.627 s, average=0.563 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6148 buffers (37.5%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=53.313 s, sync=6.437 s,
total=59.834 s; sync files=8, longest=3.680 s, average=0.804 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6240 buffers (38.1%); 0 transaction log
file(s) added, 0 removed, 3 recycled; write=149.008 s, sync=5.448 s,
total=154.566 s; sync files=9, longest=3.788 s, average=0.605 s

Note that my current effective solution is to do as if
"checkpoints_timeout = 0.2s": it works fine if I do my own spreading.

--
Fabien.

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 08:57:05
Message-ID:	20140826085705.GH21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-26 10:49:31 +0200, Fabien COELHO wrote:
>
> >What are the other settings here? checkpoint_segments,
> >checkpoint_timeout, wal_buffers?
>
> They simply are the defaults:
>
> checkpoint_segments = 3
> checkpoint_timeout = 5min
> wal_buffers = -1
>
> I did some test checkpoint_segments = 1, the problem is just more frequent
> but shorter. I also reduced wal_segsize down to 1MB, which also made it even
> more frequent but much shorter, so the overall result was an improvement
> with 5% to 3% of transactions lost instead of 10-14%, if I recall correctly.
> I have found no solution on this path.

Uh. I'm not surprised you're facing utterly horrible performance with
this. Did you try using a *large* checkpoints_segments setting? To
achieve high performance you likely will have to make checkpoint_timeout
*longer* and increase checkpoint_segments until *all* checkpoints are
started because of "time".

There's three reasons:
a) if checkpoint_timeout + completion_target is large and the checkpoint
isn't executed prematurely, most of the dirty data has been written out
by the kernel's background flush processes.

b) The amount of WAL written with less frequent checkpoints is often
*significantly* lower because fewer full page writes need to be
done. I've seen production reduction of *more* than a factor of 4.

c) If checkpoint's are infrequent enough, the penalty of them causing
problems, especially if not using ext4, plays less of a role overall.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 09:34:36
Message-ID:	alpine.DEB.2.10.1408261108450.7535@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Uh. I'm not surprised you're facing utterly horrible performance with
> this. Did you try using a *large* checkpoints_segments setting? To
> achieve high performance

I do not seek "high performance" per se, I seek "lower maximum latency".

I think that the current settings and parameters are designed for high
throughput, but do not allow to control the latency even with a small
load.

> you likely will have to make checkpoint_timeout *longer* and increase
> checkpoint_segments until *all* checkpoints are started because of
> "time".

Well, as I want to test a *small* load in a *reasonable* time, so I did
not enlarge the number of segments, otherwise it would take ages.

If I put a "checkpoint_timeout = 1min" and "checkpoint_completion_target =
0.9" so that the checkpoints are triggered by the timeout,

LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=53.645 s, sync=5.127 s,
total=58.927 s; sync files=12, longest=2.890 s, average=0.427 s
...

The result is basically the same (well 18% transactions lost, but the
result do not seem to be stable one run after the other), only there are
more checkpoints.

I fail to understand how multiplying both the segments and time would
solve the latency problem. If I set 30 segments than it takes 20 minutes
to fill them, and if I put timeout to 15min then I'll have to wait for 15
minutes to test.

> There's three reasons:
> a) if checkpoint_timeout + completion_target is large and the checkpoint
> isn't executed prematurely, most of the dirty data has been written out
> by the kernel's background flush processes.

Why would they be written by the kernel if bgwriter has not sent them??

> b) The amount of WAL written with less frequent checkpoints is often
> *significantly* lower because fewer full page writes need to be
> done. I've seen production reduction of *more* than a factor of 4.

Sure, I understand that, but ISTM that this test does not exercise this
issue, the load is small, the full page writes do not matter much.

> c) If checkpoint's are infrequent enough, the penalty of them causing
> problems, especially if not using ext4, plays less of a role overall.

I think that what you suggest would only delay the issue, not solve it.

I'll try to ran a long test.

--
Fabien.

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-26 09:58:00
Message-ID:	20140826095800.GK21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-26 11:34:36 +0200, Fabien COELHO wrote:
>
> >Uh. I'm not surprised you're facing utterly horrible performance with
> >this. Did you try using a *large* checkpoints_segments setting? To
> >achieve high performance
>
> I do not seek "high performance" per se, I seek "lower maximum latency".

So?

> I think that the current settings and parameters are designed for high
> throughput, but do not allow to control the latency even with a small load.

The way're you're setting them is tuned for 'basically no write
activity'.

> >you likely will have to make checkpoint_timeout *longer* and increase
> >checkpoint_segments until *all* checkpoints are started because of "time".
>
> Well, as I want to test a *small* load in a *reasonable* time, so I did not
> enlarge the number of segments, otherwise it would take ages.

Well, that way you're testing something basically meaningless. That's
not helpful either.

> If I put a "checkpoint_timeout = 1min" and "checkpoint_completion_target =
> 0.9" so that the checkpoints are triggered by the timeout,
>
> LOG: checkpoint starting: time
> LOG: checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log
> file(s) added, 0 removed, 0 recycled; write=53.645 s, sync=5.127 s,
> total=58.927 s; sync files=12, longest=2.890 s, average=0.427 s
> ...
>
> The result is basically the same (well 18% transactions lost, but the result
> do not seem to be stable one run after the other), only there are more
> checkpoints.

With these settings you're fsyncing the entire data directy once a
minute. Nearly entirely from the OS's buffer cache, because the OS's
writeback logic didn't have time to kick in.

> I fail to understand how multiplying both the segments and time would solve
> the latency problem. If I set 30 segments than it takes 20 minutes to fill
> them, and if I put timeout to 15min then I'll have to wait for 15 minutes to
> test.

a) The kernel's writeback logic only kicks in with delay. b) The amount
of writes you're doing with short checkpoint intervals is overall
significantly higher than with longer intervals. That obviously has
impact on latency as well as throughput. c) the time it fills for
segments to be filled is mostly irrelevant. The phase that's very likely
causing troubles for you is the fsyncs issued at the end of a
checkpoint.

> >There's three reasons:
> >a) if checkpoint_timeout + completion_target is large and the checkpoint
> >isn't executed prematurely, most of the dirty data has been written out
> >by the kernel's background flush processes.
>
> Why would they be written by the kernel if bgwriter has not sent them??

I think you're misunderstanding how spread checkpoints work. When the
checkpointer process starts a spread checkpoint it first writes all
buffers to the kernel in a paced manner. That pace is determined by
checkpoint_completion_target and checkpoint_timeout. Once all buffers
that are old enough to need to be checkpointed written out, the
checkpointer fsync()s all the on disk files. That part is *NOT*
paced. Then it can go on to remove old WAL files.

The latency problem is almost guaranteedly created by the fsync()s
mentioned above. When they're execute the kernel starts flushing out a
lot of dirty buffers at once - creating very deep IO queues which makes
it take long to process synchronous additions (WAL flushes, reads) to
that queue.

> >c) If checkpoint's are infrequent enough, the penalty of them causing
> >problems, especially if not using ext4, plays less of a role overall.
>
> I think that what you suggest would only delay the issue, not solve it.

The amount of dirty data that needs to be flushed is essentially
bounded. If you have a stall of roughly the same magnitude (say a factor
of two different), the smaller once a minute, the larger once an
hour. Obviously the once-an-hour one will have a better latency in many,
many more transactions.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 04:04:05
Message-ID:	CAA4eK1LO86acdj6z8Xg0OkLXe-dHqCVGzsfjhUyJ1tTXUFgvkw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 26, 2014 at 12:53 PM, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:
>
> Given the small flow of updates, I do not think that there should be
reason to get that big a write contention between WAL & checkpoint.
>
> If tried with "full_page_write = off" for 500 seconds: same overall
behavior, 8.5% of transactions are stuck (instead of 10%). However in
details pg_stat_bgwriter is quite different:
>
> buffers_checkpoint = 13906
> buffers_clean = 20748
> buffers_backend = 472
>
> That seems to suggest that bgwriter did some stuff for once, but that did
not change much the result in the end. This would imply that my suggestion
> to make bgwriter write more would not fix the problem alone.

I think the reason could be that in most cases bgwriter
passes the sync responsibility to checkpointer rather than
doing by itself which causes IO storm during checkpoint
and another thing is that it will not proceed to even write
the dirty buffer unless the refcounf and usage_count of
buffer is zero. I see there is some merit in your point which
is to make bgwriter more useful than its current form.
I could see 3 top level points to think about whether improvement
in any of those can improve the current situation:
a. Scanning of buffer pool to find the dirty buffers that can
be flushed.
b. Deciding on what is criteria to flush a buffer
c. Sync of buffers

> With "synchronous_commit = off", the situation is much improved, with
only 0.3% of transactions stuck. Not a surprise. However, I would not
> recommand that as a solution:-)

Yeah, actually it was just to test what is the actual problem,
I will also not recommend such a solution, however it gives
us atleast the indication that due to IO in checkpoint, backends
are not even able to flush comparatively small WAL data.
How about if you keep WAL (pg_xlog) on a separate filesystem
may be via creating symlink or if possible with -X option of initdb?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 07:32:16
Message-ID:	alpine.DEB.2.10.1408261739280.8876@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Andres,

> [...]
> I think you're misunderstanding how spread checkpoints work.

Yep, definitely:-) On the other hand I though I was seeking something
"simple", namely correct latency under small load, that I would expect out
of the box.

What you describe is reasonable, and is more or less what I was hoping
for, although I thought that bgwriter was involved from the start and
checkpoint would only do what is needed in the end. My mistake.

> When the checkpointer process starts a spread checkpoint it first writes
> all buffers to the kernel in a paced manner.
> That pace is determined by checkpoint_completion_target and
> checkpoint_timeout.

This pacing does not seem to work, even at slow pace.

> If you have a stall of roughly the same magnitude (say a factor
> of two different), the smaller once a minute, the larger once an
> hour. Obviously the once-an-hour one will have a better latency in many,
> many more transactions.

I do not believe in delaying as much as possible writing do disk to handle
a small load as a viable strategy. However, to show my good will, I have
tried to follow your advices: I've launched a 5000 seconds test with 50
segments, 30 min timeout, 0.9 completion target, at 25 tps, which is less
than 1/10 of the maximum throughput.

There are only two time-triggered checkpoints:

LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 48725 buffers (47.6%);
1 transaction log file(s) added, 0 removed, 0 recycled;
write=1619.750 s, sync=27.675 s, total=1647.932 s;
sync files=14, longest=27.593 s, average=1.976 s

LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 22533 buffers (22.0%);
0 transaction log file(s) added, 0 removed, 23 recycled;
write=826.919 s, sync=9.989 s, total=837.023 s;
sync files=8, longest=6.742 s, average=1.248 s

For the first one, 48725 buffers is 380MB. 1800 * 0.9 = 1620 seconds to
complete, so it means 30 buffer writes per second... should be ok. However
sync costs 27 seconds nevertheless, and the server was more or less
offline for about 30 seconds flat. For the second one, 180 MB to write, 10
seconds offline. For some reason the target time is reduced. I have also
tried with the "deadline" IO scheduler which make more sense than the
default "cfq", but the result was similar. Not sure how software RAID
interacts with IO scheduling, though.

Overall result: over the 5000s test, I have lost (i.e. more than 200ms
behind schedule) more than 2.5% of transactions (1/40). Due to the
unfinished cycle, the long term average is probably about 3%. Although it
is better than 10%, it is not good. I would expect/hope for something
pretty close to 0, even with ext4 on Linux, for a dedicated host which has
nothing else to do but handle two dozen transactions per second.

Current conclusion: I have not found any way to improve the situation to
"good" with parameters from the configuration. Currently a small load
results in periodic offline time, that can be delayed but not avoided. The
delaying tactic results in less frequent but longer downtime. I prefer
frequent very short downtime instead.

I really think that something is amiss. Maybe pg does not handle pacing as
it should.

For the record, a 25tps bench with a "small" config (default 3 segments,
5min timeout, 0.5 completion target) and with a parallel:

while true ; do echo "CHECKPOINT;"; sleep 0.2s; done | psql

results in "losing" only 0.01% of transactions (12 transactions out of
125893 where behind more than 200ms in 5000 seconds). Although you may
think it stupid, from my point of view it shows that it is possible to
coerce pg to behave.

With respect to the current status:

(1) the ability to put checkpoint_timeout to values smaller than 30s could
help, although obviously there would be other consequences. But the
ability to avoid periodic offline time looks like a desirable objective.

(2) I still think that a parameter to force bgwriter to write more stuff
could help, but this is not tested.

(3) Any other effective idea to configure for responsiveness is welcome!

If someone wants to repeat these tests, it is easy and only takes a few
minutes:

sh> createdb test
sh> pgbench -i -s 100 -F 95 test
sh> pgbench -M prepared -N -R 25 -L 200 -c 2 -T 5000 -P 1 test > pgb.out

Note: the -L to limit latency is a submitted patch. Without this,
unresponsiveness shows as increasing laging time.

--
Fabien.

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 08:30:26
Message-ID:	20140827083026.GB21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-27 09:32:16 +0200, Fabien COELHO wrote:
>
> Hello Andres,
>
> >[...]
> >I think you're misunderstanding how spread checkpoints work.
>
> Yep, definitely:-) On the other hand I though I was seeking something
> "simple", namely correct latency under small load, that I would expect out
> of the box.

Yea. The current situation *sucks*. Both from the utterly borked
behaviour of ext4 and other filesystems and the lack of workaround from postgres.

> >When the checkpointer process starts a spread checkpoint it first writes
> >all buffers to the kernel in a paced manner.
> >That pace is determined by checkpoint_completion_target and
> >checkpoint_timeout.
>
> This pacing does not seem to work, even at slow pace.

It definitely does in some cases. What's your evidence the pacing
doesn't work? Afaik it's the fsync that causes the problem, not the the
writes themselves.

> >If you have a stall of roughly the same magnitude (say a factor
> >of two different), the smaller once a minute, the larger once an
> >hour. Obviously the once-an-hour one will have a better latency in many,
> >many more transactions.
>
> I do not believe in delaying as much as possible writing do disk to handle a
> small load as a viable strategy. However, to show my good will, I have
> tried to follow your advices: I've launched a 5000 seconds test with 50
> segments, 30 min timeout, 0.9 completion target, at 25 tps, which is less
> than 1/10 of the maximum throughput.
>
> There are only two time-triggered checkpoints:
>
> LOG: checkpoint starting: time
> LOG: checkpoint complete: wrote 48725 buffers (47.6%);
> 1 transaction log file(s) added, 0 removed, 0 recycled;
> write=1619.750 s, sync=27.675 s, total=1647.932 s;
> sync files=14, longest=27.593 s, average=1.976 s
>
> LOG: checkpoint starting: time
> LOG: checkpoint complete: wrote 22533 buffers (22.0%);
> 0 transaction log file(s) added, 0 removed, 23 recycled;
> write=826.919 s, sync=9.989 s, total=837.023 s;
> sync files=8, longest=6.742 s, average=1.248 s

The write pacing itself doesn't seem to be bad. The bad thing is the
'sync' times here. Those are *NOT* paced and kernel probably has delayed
flushing out much the writes...

> (1) the ability to put checkpoint_timeout to values smaller than 30s could
> help, although obviously there would be other consequences. But the ability
> to avoid periodic offline time looks like a desirable objective.

I'd rather not do that. It's a utterly horrible hack to go this write.

> (2) I still think that a parameter to force bgwriter to write more stuff
> could help, but this is not tested.

It's going to be random writes. That's not going to be helpful.

> (3) Any other effective idea to configure for responsiveness is welcome!

I've a couple of ideas how to improve the situation, but so far I've not
had the time to investigate them properly. Would you be willing to test
a couple of simple patches?

Did you test xfs already?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 09:05:52
Message-ID:	alpine.DEB.2.10.1408271035050.8876@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
> that causes the problem, not the the writes themselves.

Hmmm. My (poor) understanding is that fsync would work fine if everything
was already written beforehand:-) that is it has nothing to do but assess
that all is already written. If there is remaining write work, it starts
doing it "now" with the disastrous effects I'm complaining about.

When I say "pacing does not work", I mean that things where not written
out to disk by the OS, it does not mean that pg did not ask for it.

However it does not make much sense for an OS scheduler to wait several
minutes with tens of thousands of pages to write and do nothing about
it... So I'm wondering.

> [...]
>> (1) the ability to put checkpoint_timeout to values smaller than 30s could
>> help, although obviously there would be other consequences. But the ability
>> to avoid periodic offline time looks like a desirable objective.
>
> I'd rather not do that. It's a utterly horrible hack to go this write.

Hmmm. It does solve the issue, though:-) It would be the administrator
choice. It is better than nothing, which is the current status.

>> (2) I still think that a parameter to force bgwriter to write more stuff
>> could help, but this is not tested.
>
> It's going to be random writes. That's not going to be helpful.

The -N small OLTP load on a large (GB) table *is* random writes anyway,
whether they occur at checkpoint or at any other time. Random writes are
fine in this case, the load is small, there should be no problem.

>> (3) Any other effective idea to configure for responsiveness is
>> welcome!
>
> I've a couple of ideas how to improve the situation, but so far I've not
> had the time to investigate them properly. Would you be willing to test
> a couple of simple patches?

I can test a couple of patches. I already did one on someone advice (make
bgwriter round all stuff in 1s instead of 120s, without positive effect.

> Did you test xfs already?

No. I cannot without reinstalling, which I cannot do on a remote host, and
I will probably not have time to do it when I'll have physical access.
Only one partition on the host. My mistake. Will not do it again. Shame on
me.

If someone out there has an XFS setup, it is very easy to test and only
takes a couple of minutes, really. It takes less time to do it than to
write a mail about it afterwards:-)

I have tested FreeBSD/UFS with similar results, a few periodic offlines.
UFS journaled file system is probably not ideal for database work, but yet
again the load is small, it should be able to cope without going offline.

--
Fabien.

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 09:14:46
Message-ID:	20140827091446.GC21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote:
> I can test a couple of patches. I already did one on someone advice (make
> bgwriter round all stuff in 1s instead of 120s, without positive effect.

I've quickly cobbled together the attached patch (which at least doesn't
seem to crash & burn). It tries to trigger pages being flushed out
during the paced phase of checkpoints instead of the fsync phase. The
sync_on_checkpoint_flush can be used to enable/disable that behaviour.

I'd be interested to hear whether that improves your latency numbers. I
unfortunately don't have more time to spend on this right now :(.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 09:19:22
Message-ID:	20140827091922.GD21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-27 11:14:46 +0200, Andres Freund wrote:
> On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote:
> > I can test a couple of patches. I already did one on someone advice (make
> > bgwriter round all stuff in 1s instead of 120s, without positive effect.
>
> I've quickly cobbled together the attached patch (which at least doesn't
> seem to crash & burn). It tries to trigger pages being flushed out
> during the paced phase of checkpoints instead of the fsync phase. The
> sync_on_checkpoint_flush can be used to enable/disable that behaviour.
>
> I'd be interested to hear whether that improves your latency numbers. I
> unfortunately don't have more time to spend on this right now :(.

And actually attached. Note that it's linux only...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
force-flush-during-buffer-sync.diff	text/x-diff	10.4 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 10:28:43
Message-ID:	20140827102843.GE21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-27 11:19:22 +0200, Andres Freund wrote:
> On 2014-08-27 11:14:46 +0200, Andres Freund wrote:
> > On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote:
> > > I can test a couple of patches. I already did one on someone advice (make
> > > bgwriter round all stuff in 1s instead of 120s, without positive effect.
> >
> > I've quickly cobbled together the attached patch (which at least doesn't
> > seem to crash & burn). It tries to trigger pages being flushed out
> > during the paced phase of checkpoints instead of the fsync phase. The
> > sync_on_checkpoint_flush can be used to enable/disable that behaviour.
> >
> > I'd be interested to hear whether that improves your latency numbers. I
> > unfortunately don't have more time to spend on this right now :(.
>
> And actually attached. Note that it's linux only...

I got curious and ran a quick test:

config:
log_checkpoints=on
checkpoint_timeout=1min
checkpoint_completion_target=0.95
checkpoint_segments=100
synchronous_commit=on
fsync=on
huge_pages=on
max_connections=200
shared_buffers=6GB
wal_level=hot_standby

off:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 150
query mode: prepared
number of clients: 16
number of threads: 16
duration: 120 s
number of transactions actually processed: 20189
latency average: 23.136 ms
latency stddev: 59.044 ms
rate limit schedule lag: avg 4.599 (max 199.975) ms
number of skipped transactions: 1345 (6.246 %)
tps = 167.664290 (including connections establishing)
tps = 167.675679 (excluding connections establishing)

LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 12754 buffers (1.6%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=56.928 s, sync=3.639 s, total=60.749 s; sync files=20, longest=2.741 s, average=0.181 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 12269 buffers (1.6%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=20.701 s, sync=8.568 s, total=29.444 s; sync files=10, longest=3.568 s, average=0.856 s

on:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 150
query mode: prepared
number of clients: 16
number of threads: 16
duration: 120 s
number of transactions actually processed: 21327
latency average: 20.735 ms
latency stddev: 14.643 ms
rate limit schedule lag: avg 4.965 (max 185.003) ms
number of skipped transactions: 1 (0.005 %)
tps = 177.214391 (including connections establishing)
tps = 177.225476 (excluding connections establishing)

LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 12217 buffers (1.6%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=57.022 s, sync=0.203 s, total=57.377 s; sync files=19, longest=0.033 s, average=0.010 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 13185 buffers (1.7%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=56.628 s, sync=0.019 s, total=56.803 s; sync files=11, longest=0.017 s, average=0.001 s

That machine is far from idle right now, so the noise is pretty
high. But rather nice initial results.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 12:29:05
Message-ID:	alpine.DEB.2.10.1408271413290.8876@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Amit,

> I see there is some merit in your point which is to make bgwriter more
> useful than its current form. I could see 3 top level points to think
> about whether improvement in any of those can improve the current
> situation:

> a. Scanning of buffer pool to find the dirty buffers that can
> be flushed.

Yep, that could be more aggressive, or the aggressiveness could be made
into an adjustible parameter. There is a comment about that in the source.
However I tested setting the round to 1s instead of 120s, and it had no
positive effect on my test.

> b. Deciding on what is criteria to flush a buffer

Indeed.

For instance when the IO load is low, there is no reason not to send out
some buffers, it is a free lunch even if it must be done again later. If
the load is high, this is another matter. So it would mean being able to
decide whether the current IO load is low or not. It could be different on
different disk... Hmmm, not that simple.

> c. Sync of buffers

Yes. That is more or less the effect of my rough approach of calling
CHECKPOINT every 0.2 seconds. Andres Freund just implemented a quick
linux-specific patch which does that within CHECKPOINT pacing, and which
greatly improves the situation, although it could still be a little
better.

> [separate xlog test]

I do not have a setup available for that right now. Not sure this would be
an issue for low loads.

--
Fabien.

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 13:10:49
Message-ID:	CAGTBQpaNm6WTDzJNxNeXv9SKn5g6-8DSQ0MV-TWBW-6_8On2zw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:
>> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
>> that causes the problem, not the the writes themselves.
>
>
> Hmmm. My (poor) understanding is that fsync would work fine if everything
> was already written beforehand:-) that is it has nothing to do but assess
> that all is already written. If there is remaining write work, it starts
> doing it "now" with the disastrous effects I'm complaining about.
>
> When I say "pacing does not work", I mean that things where not written out
> to disk by the OS, it does not mean that pg did not ask for it.
>
> However it does not make much sense for an OS scheduler to wait several
> minutes with tens of thousands of pages to write and do nothing about it...
> So I'm wondering.

Maybe what's needed, is to slightly tweak checkpoint logic to give the
kernel some time to flush buffers.

Correct me if I'm wrong, but the checkpointer does the sync right
after the reads. Of course there will be about 30s-worth of
accumulated writes (it's the default amount of time the kernel holds
on to dirty buffers).

Perhaps it should be delayed a small time, say 30s, to let the kernel
do the writing on its own.

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 13:12:09
Message-ID:	CAGTBQpY=nctyXz7gQZRwrkiCzJ_yfUgkAPGzKm+hpjvJ+5NqtA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 27, 2014 at 10:10 AM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:
>>> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
>>> that causes the problem, not the the writes themselves.
>>
>>
>> Hmmm. My (poor) understanding is that fsync would work fine if everything
>> was already written beforehand:-) that is it has nothing to do but assess
>> that all is already written. If there is remaining write work, it starts
>> doing it "now" with the disastrous effects I'm complaining about.
>>
>> When I say "pacing does not work", I mean that things where not written out
>> to disk by the OS, it does not mean that pg did not ask for it.
>>
>> However it does not make much sense for an OS scheduler to wait several
>> minutes with tens of thousands of pages to write and do nothing about it...
>> So I'm wondering.
>
> Maybe what's needed, is to slightly tweak checkpoint logic to give the
> kernel some time to flush buffers.
>
> Correct me if I'm wrong, but the checkpointer does the sync right
> after the reads. Of course there will be about 30s-worth of
> accumulated writes (it's the default amount of time the kernel holds
> on to dirty buffers).
>
> Perhaps it should be delayed a small time, say 30s, to let the kernel
> do the writing on its own.

Errata: just after the writes :-p

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 13:15:07
Message-ID:	20140827131507.GH21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-27 10:10:49 -0300, Claudio Freire wrote:
> On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:
> >> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
> >> that causes the problem, not the the writes themselves.
> >
> >
> > Hmmm. My (poor) understanding is that fsync would work fine if everything
> > was already written beforehand:-) that is it has nothing to do but assess
> > that all is already written. If there is remaining write work, it starts
> > doing it "now" with the disastrous effects I'm complaining about.
> >
> > When I say "pacing does not work", I mean that things where not written out
> > to disk by the OS, it does not mean that pg did not ask for it.
> >
> > However it does not make much sense for an OS scheduler to wait several
> > minutes with tens of thousands of pages to write and do nothing about it...
> > So I'm wondering.
>
> Maybe what's needed, is to slightly tweak checkpoint logic to give the
> kernel some time to flush buffers.
>
> Correct me if I'm wrong, but the checkpointer does the sync right
> after the reads. Of course there will be about 30s-worth of
> accumulated writes (it's the default amount of time the kernel holds
> on to dirty buffers).
>
> Perhaps it should be delayed a small time, say 30s, to let the kernel
> do the writing on its own.

The kernel *starts* to write out pages after 30s, it doesn't finish
doing so. So I don't think that's going to work.

I think a somewhat smarter version of the explicit flushes in the
hack^Wpatch I posted nearby is going to more likely to be successful.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 13:17:06
Message-ID:	CAGTBQpav49A0Hq9y=DZ2LV+LoJgXvN2tcLnXG-ATK+O0VzSuUg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 27, 2014 at 10:15 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-08-27 10:10:49 -0300, Claudio Freire wrote:
>> On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:
>> >> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
>> >> that causes the problem, not the the writes themselves.
>> >
>> >
>> > Hmmm. My (poor) understanding is that fsync would work fine if everything
>> > was already written beforehand:-) that is it has nothing to do but assess
>> > that all is already written. If there is remaining write work, it starts
>> > doing it "now" with the disastrous effects I'm complaining about.
>> >
>> > When I say "pacing does not work", I mean that things where not written out
>> > to disk by the OS, it does not mean that pg did not ask for it.
>> >
>> > However it does not make much sense for an OS scheduler to wait several
>> > minutes with tens of thousands of pages to write and do nothing about it...
>> > So I'm wondering.
>>
>> Maybe what's needed, is to slightly tweak checkpoint logic to give the
>> kernel some time to flush buffers.
>>
>> Correct me if I'm wrong, but the checkpointer does the sync right
>> after the reads. Of course there will be about 30s-worth of
>> accumulated writes (it's the default amount of time the kernel holds
>> on to dirty buffers).
>>
>> Perhaps it should be delayed a small time, say 30s, to let the kernel
>> do the writing on its own.
>
> The kernel *starts* to write out pages after 30s, it doesn't finish
> doing so. So I don't think that's going to work.
>
> I think a somewhat smarter version of the explicit flushes in the
> hack^Wpatch I posted nearby is going to more likely to be successful.

That path is "dangerous" (as in, may not work as intended) if the
filesystem doesn't properly understand range flushes (ehem, like
ext3).

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 13:20:45
Message-ID:	20140827132045.GI21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
> > I think a somewhat smarter version of the explicit flushes in the
> > hack^Wpatch I posted nearby is going to more likely to be successful.
>
>
> That path is "dangerous" (as in, may not work as intended) if the
> filesystem doesn't properly understand range flushes (ehem, like
> ext3).

The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
guaranteeing durability. And - afaik - not implemented in a file system
specific manner. It just initiates writeback for individual pages. It
doesn't cause barrier, journal flushes or anything to be issued. That's
still done by the fsync() later.

The big disadvantage is that it's a OS specific solution, but I don't
think we're going to find anything that isn't in this area.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 13:50:02
Message-ID:	CAGTBQpbQK8i+U67eu3=amo=czpHG6tPX+RVRpg-sDRE6G3cycQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 27, 2014 at 10:20 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
>> > I think a somewhat smarter version of the explicit flushes in the
>> > hack^Wpatch I posted nearby is going to more likely to be successful.
>>
>>
>> That path is "dangerous" (as in, may not work as intended) if the
>> filesystem doesn't properly understand range flushes (ehem, like
>> ext3).
>
> The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
> guaranteeing durability. And - afaik - not implemented in a file system
> specific manner. It just initiates writeback for individual pages. It
> doesn't cause barrier, journal flushes or anything to be issued. That's
> still done by the fsync() later.
>
> The big disadvantage is that it's a OS specific solution, but I don't
> think we're going to find anything that isn't in this area.

I guess it should work then.

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 14:32:19
Message-ID:	CAC_2qU9ePZEh=T5Jm8HonrwT2BZ-j-7FQvXv6dV4Zm6yr1OqPA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 27, 2014 at 3:32 AM, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:

>
> Hello Andres,
>
> [...]
>>
>> I think you're misunderstanding how spread checkpoints work.
>>
>
> Yep, definitely:-) On the other hand I though I was seeking something
> "simple", namely correct latency under small load, that I would expect out
> of the box.
>
> What you describe is reasonable, and is more or less what I was hoping
> for, although I thought that bgwriter was involved from the start and
> checkpoint would only do what is needed in the end. My mistake.
>
>
If all you want is to avoid the write storms when fsyncs start happening on
slow storage, can you not just adjust the kernel vm.dirty* tunables to
start making the kernel write out dirty buffers much sooner instead of
letting them accumulate until fsyncs force them out all at once?

>
a.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a
slave.

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 14:35:04
Message-ID:	20140827143504.GJ21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-27 10:32:19 -0400, Aidan Van Dyk wrote:
> On Wed, Aug 27, 2014 at 3:32 AM, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:
>
> >
> > Hello Andres,
> >
> > [...]
> >>
> >> I think you're misunderstanding how spread checkpoints work.
> >>
> >
> > Yep, definitely:-) On the other hand I though I was seeking something
> > "simple", namely correct latency under small load, that I would expect out
> > of the box.
> >
> > What you describe is reasonable, and is more or less what I was hoping
> > for, although I thought that bgwriter was involved from the start and
> > checkpoint would only do what is needed in the end. My mistake.
> >
> >
> If all you want is to avoid the write storms when fsyncs start happening on
> slow storage, can you not just adjust the kernel vm.dirty* tunables to
> start making the kernel write out dirty buffers much sooner instead of
> letting them accumulate until fsyncs force them out all at once?

Well. For one that's a os specific global tunable requiring root to be
adjustable. For another we actually do want some buffering: If a backend
writes out a buffer's data itself (happens very frequently) it *should*
get buffered... So I don't think a process independent tunable is going
to do the trick.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 14:50:24
Message-ID:	alpine.DEB.2.10.1408271634540.8876@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

> If all you want is to avoid the write storms when fsyncs start happening on
> slow storage, can you not just adjust the kernel vm.dirty* tunables to
> start making the kernel write out dirty buffers much sooner instead of
> letting them accumulate until fsyncs force them out all at once?

I can try, when I have finished with the current round of testing.

Note that, as Andres put it, it currently "sucks". Having to tinker with
linux kernel parameters just to handle a small load without being offline
10% of the time does not look very good, so even if it works, ISTM that a
pg side solution is desirable.

--
Fabien.

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 16:23:04
Message-ID:	53FE05E8.2010606@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/27/2014 04:20 PM, Andres Freund wrote:
> On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
>>> I think a somewhat smarter version of the explicit flushes in the
>>> hack^Wpatch I posted nearby is going to more likely to be successful.
>>
>>
>> That path is "dangerous" (as in, may not work as intended) if the
>> filesystem doesn't properly understand range flushes (ehem, like
>> ext3).
>
> The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
> guaranteeing durability. And - afaik - not implemented in a file system
> specific manner. It just initiates writeback for individual pages. It
> doesn't cause barrier, journal flushes or anything to be issued. That's
> still done by the fsync() later.
>
> The big disadvantage is that it's a OS specific solution, but I don't
> think we're going to find anything that isn't in this area.

I've been thinking for a long time that we should interleave the writes
and the fsyncs. That still forces up to 1GB of dirty buffers to disk at
once, causing a spike, but at least not more than that. Also, the
scheduling of a spread checkpoint is currently a bit bogus; we don't
take into account the time needed for the fsync phase.

A long time ago, Itagaki Takahiro wrote a patch sort the buffers and
write them out in order
(http://www.postgresql.org/message-id/flat/20070614153758(dot)6A62(dot)ITAGAKI(dot)TAKAHIRO(at)oss(dot)ntt(dot)co(dot)jp).
The performance impact of that was inconclusive, but one thing that it
allows nicely is to interleave the fsyncs, so that you write all the
buffers for one file, then fsync it, then next file and so on. IIRC the
biggest worry with that patch was that sorting the buffers requires a
fairly large amount of memory, and making a large allocation in the
checkpointer might cause an out-of-memory, which would be bad.

I don't think anyone's seriously worked on this area since. If the
impact on responsiveness or performance is significant, I'm pretty sure
the OOM problem could be alleviated somehow.

For the kicks, I wrote a quick & dirty patch for interleaving the
fsyncs, see attached. It works by repeatedly scanning the buffer pool,
writing buffers belonging to a single relation segment at a time. I
would be interested to hear how this performs in your test case.

- Heikki

Attachment	Content-Type	Size
interleave-fsyncs-1.patch	text/x-diff	9.7 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 16:41:15
Message-ID:	20140827164115.GL21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
> On 08/27/2014 04:20 PM, Andres Freund wrote:
> >On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
> >>>I think a somewhat smarter version of the explicit flushes in the
> >>>hack^Wpatch I posted nearby is going to more likely to be successful.
> >>
> >>
> >>That path is "dangerous" (as in, may not work as intended) if the
> >>filesystem doesn't properly understand range flushes (ehem, like
> >>ext3).
> >
> >The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
> >guaranteeing durability. And - afaik - not implemented in a file system
> >specific manner. It just initiates writeback for individual pages. It
> >doesn't cause barrier, journal flushes or anything to be issued. That's
> >still done by the fsync() later.
> >
> >The big disadvantage is that it's a OS specific solution, but I don't
> >think we're going to find anything that isn't in this area.
>
> I've been thinking for a long time that we should interleave the writes and
> the fsyncs. That still forces up to 1GB of dirty buffers to disk at once,
> causing a spike, but at least not more than that.

I think there are considerable benefits to sorting checkpoint io by file
and offset in that file. Obviously the likelihood of sequential IO is
higher; but there's also less chance that other processes write out
dirty buffers that have to be flushed out by the fsync() in a drive by
manner.

I don't think it's good enough to solve the problem Fabien is talking
about though. 1GB is heck of a lot of IO to submit at once. That'll
cause latency issues unless you have a write back controller with more
than 1GB of cache.

So I think we need both, control over the amount of dirty data in the
kernel *and* sorted writeouts. To the point that I've been tinkering
with converting buftable.c into a radix tree. That'd allow to
efficiently scan all buffers of a filenode in order. Useful for
checkpoints, but also for lots of other things. Unfortunately our buffer
tags are freakishly huge, making the worst case memory requirements and
the depth of tree quite bad.

> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
> them out in order (http://www.postgresql.org/message-id/flat/20070614153758(dot)6A62(dot)ITAGAKI(dot)TAKAHIRO(at)oss(dot)ntt(dot)co(dot)jp).
> The performance impact of that was inconclusive, but one thing that it
> allows nicely is to interleave the fsyncs, so that you write all the buffers
> for one file, then fsync it, then next file and so on. IIRC the biggest
> worry with that patch was that sorting the buffers requires a fairly large
> amount of memory, and making a large allocation in the checkpointer might
> cause an out-of-memory, which would be bad.
>
> I don't think anyone's seriously worked on this area since. If the impact on
> responsiveness or performance is significant, I'm pretty sure the OOM
> problem could be alleviated somehow.

It's a major problem imo.

What I'd been thinking of is to checkpoint writeout in batches. Collect
100k buffers, sort them, write them out. Go to the next 100k.

> For the kicks, I wrote a quick & dirty patch for interleaving the fsyncs,
> see attached. It works by repeatedly scanning the buffer pool, writing
> buffers belonging to a single relation segment at a time. I would be
> interested to hear how this performs in your test case.

I bet it's not fundamentally changing the amount of transactions that
don't make the deadline - there's more than enough dirty buffers in one
1GB segment to cause issues. But I think it might already be a
significant benefit for peak throughput *and* latency if you combine it
with my approach of initiating writeout to disk during the whole
sync. My patch can slow things down considerably in the worst case by
causing superflous random IO, which your patch should alleviate.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 17:00:12
Message-ID:	alpine.DEB.2.10.1408271343140.8876@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> off:
>
> $ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
> number of skipped transactions: 1345 (6.246 %)
>
> on:
>
> $ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
> number of skipped transactions: 1 (0.005 %)

> That machine is far from idle right now, so the noise is pretty high.

What is the OS and FS? Could it be XFS?

> But rather nice initial results.

Indeed, I can confirm:

I did 5000s 25tps tests:
- Off: 8002 transactions lost (6.3%)
- On: 158 transactions "lost" (0.12%).

Although it is still 13 times larger than the 12 (0.01%) lost with my
every 0.2s CHECKPOINT hack, it is nevertheless much much better than
before!

The bad news, under pgbench unthrottled load, the tps is divided by 2 (300
-> 150, could have been worse), *BUT* is also much smoother, the tps is
not going to 0, but stay in 50-100 range before the next spike.

I'm wondering about he order of operations. It seems to me that you sync
just after giving back a buffer. Maybe it would be better to pipeline it,
that is something like:

round 0:
send buffers 0
sleep?

round N:
sync buffers N-1
send buffers N
sleep?

final N sync:
sync buffer N

I have not found how to control the checkpoint pacing interval, if there
is such a thing. With a 200ms lag limit on pgbench, it would be nice if it
is less than 200ms.

I found this old thread "Add basic checkpoint sync spreading" by Greg
Smith and Simons Riggs, dating from 2010:
http://www.postgresql.org/message-id/4CE07548.4030709@2ndquadrant.com
https://commitfest.postgresql.org/action/patch_view?id=431 which ends up
"returned with feedback".

--
Fabien.

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-27 17:07:13
Message-ID:	20140827170713.GN21544@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-27 19:00:12 +0200, Fabien COELHO wrote:
>
> >off:
> >
> >$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
> >number of skipped transactions: 1345 (6.246 %)
> >
> >on:
> >
> >$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
> >number of skipped transactions: 1 (0.005 %)
>
> >That machine is far from idle right now, so the noise is pretty high.
>
> What is the OS and FS? Could it be XFS?

That's linux v3.17-rc2 + ext4.

> >But rather nice initial results.
>
> Indeed, I can confirm:
>
> I did 5000s 25tps tests:
> - Off: 8002 transactions lost (6.3%)
> - On: 158 transactions "lost" (0.12%).
>
> Although it is still 13 times larger than the 12 (0.01%) lost with my every
> 0.2s CHECKPOINT hack, it is nevertheless much much better than before!
>
> The bad news, under pgbench unthrottled load, the tps is divided by 2 (300
> -> 150, could have been worse), *BUT* is also much smoother, the tps is not
> going to 0, but stay in 50-100 range before the next spike.

Yea, I'm not surprised. With a sensible (aka larger) checkpoint_timeout
the performance penalty isn't that big, but it's there. That's why I
think (as mentioned to Heikki nearby) it needs to be combined with
sorting during the checkpoint phase.

> I'm wondering about he order of operations. It seems to me that you sync
> just after giving back a buffer.

Yep. Was just a rather quick patch...

> Maybe it would be better to pipeline it,
> that is something like:
>
> round 0:
> send buffers 0
> sleep?
>
> round N:
> sync buffers N-1
> send buffers N
> sleep?
>
> final N sync:
> sync buffer N

Yes, I think we're going to need to leave a it more room for write
combining and such here. But I think it's going to better to issue
flushes for several buffers together - just not after each write(). To
be really beneficial it needs sorted output though.

> I have not found how to control the checkpoint pacing interval, if there is
> such a thing. With a 200ms lag limit on pgbench, it would be nice if it is
> less than 200ms.

Not sure what you mean.

> I found this old thread "Add basic checkpoint sync spreading" by Greg Smith
> and Simons Riggs, dating from 2010:
> http://www.postgresql.org/message-id/4CE07548.4030709@2ndquadrant.com
> https://commitfest.postgresql.org/action/patch_view?id=431 which ends up
> "returned with feedback".

I didn't really like the unapplied remainder of what was proposed in
there.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-28 06:27:26
Message-ID:	alpine.DEB.2.10.1408280811590.28571@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Aidan,

I tried that by setting:
vm.dirty_expire_centisecs = 100
vm.dirty_writeback_centisecs = 100

So it should start writing returned buffers at most 2s after they are
returned, if I understood the doc correctly, instead of at most 35s.

The result is that with a 5000s 25tps pretty small load (the system can do
300tps with the default configuration), I lost 2091 (1.7%) of
transactions, that is they were beyond the 200ms schedule limit. Another
change is that overall the lost transactions are more spread than without
this setting, although there still is stretches of unresponsiveness.

So although the situation is significantly better, it is still far from
good with the much reduced value I tried.

--
Fabien.

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-28 16:00:02
Message-ID:	CAGTBQpYuT7QrLNRZy4VF1HKW=nW+50c3POC8OxHMPFMt6FYUnA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 28, 2014 at 3:27 AM, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:
> Hello Aidan,
>
>
>> If all you want is to avoid the write storms when fsyncs start happening
>> on
>> slow storage, can you not just adjust the kernel vm.dirty* tunables to
>> start making the kernel write out dirty buffers much sooner instead of
>> letting them accumulate until fsyncs force them out all at once?
>
>
> I tried that by setting:
> vm.dirty_expire_centisecs = 100
> vm.dirty_writeback_centisecs = 100
>
> So it should start writing returned buffers at most 2s after they are
> returned, if I understood the doc correctly, instead of at most 35s.
>
> The result is that with a 5000s 25tps pretty small load (the system can do
> 300tps with the default configuration), I lost 2091 (1.7%) of transactions,
> that is they were beyond the 200ms schedule limit. Another change is that
> overall the lost transactions are more spread than without this setting,
> although there still is stretches of unresponsiveness.
>
> So although the situation is significantly better, it is still far from good
> with the much reduced value I tried.

What do the checkpoint logs look like now? (especially interested in sync times)

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-28 18:04:06
Message-ID:	alpine.DEB.2.10.1408281947350.28571@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> I tried that by setting:
>> vm.dirty_expire_centisecs = 100
>> vm.dirty_writeback_centisecs = 100
>>
>> So it should start writing returned buffers at most 2s after they are
>> returned, if I understood the doc correctly, instead of at most 35s.
>>
>> The result is that with a 5000s 25tps pretty small load (the system can do
>> 300tps with the default configuration), I lost 2091 (1.7%) of transactions,
>> that is they were beyond the 200ms schedule limit. Another change is that
>> overall the lost transactions are more spread than without this setting,
>> although there still is stretches of unresponsiveness.
>>
>> So although the situation is significantly better, it is still far from good
>> with the much reduced value I tried.
>
> What do the checkpoint logs look like now? (especially interested in sync times)

Here is an extract:

LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 4893 buffers (4.8%);
0 transaction log file(s) added, 0 removed, 0 recycled;
write=128.430 s, sync=0.378 s, total=128.921 s;
sync files=11, longest=0.375 s, average=0.034 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6200 buffers (6.1%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=128.934 s, sync=0.240 s, total=129.280 s;
sync files=7, longest=0.132 s, average=0.034 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6177 buffers (6.0%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=130.146 s, sync=0.322 s, total=130.592 s;
sync files=7, longest=0.185 s, average=0.046 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6185 buffers (6.0%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=132.673 s, sync=0.143 s, total=132.909 s;
sync files=5, longest=0.078 s, average=0.028 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6188 buffers (6.0%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=130.415 s, sync=0.170 s, total=130.676 s;
sync files=5, longest=0.132 s, average=0.034 s
LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 6203 buffers (6.1%);
0 transaction log file(s) added, 3 removed, 0 recycled;
write=131.726 s, sync=0.444 s, total=132.256 s;
sync files=5, longest=0.441 s, average=0.088 s
...

Basically sync is between 0.1-0.5 seconds.

I had one particulary bad stretch of unresponsiveness, which is not
clearly related to a slow checkpoint sync:

progress: 4065.0 s, 28.7 tps, lat 11.886 ms stddev 48.470, lag 18.851 ms, 0 skipped
progress: 4066.2 s, 1.6 tps, lat 952.830 ms stddev 280.673, lag 155.310 ms, 25 skipped
progress: 4067.0 s, 2.7 tps, lat 1067.730 ms stddev 316.321, lag 171.766 ms, 14 skipped
progress: 4068.4 s, 1.4 tps, lat 1147.392 ms stddev 240.337, lag 132.297 ms, 35 skipped
progress: 4069.7 s, 2.3 tps, lat 1034.543 ms stddev 213.786, lag 154.453 ms, 35 skipped
progress: 4070.2 s, 1.8 tps, lat 562.715 ms stddev 0.000, lag 182.833 ms, 12 skipped
progress: 4071.3 s, 3.6 tps, lat 600.929 ms stddev 108.232, lag 162.723 ms, 25 skipped
progress: 4072.5 s, 1.7 tps, lat 1077.187 ms stddev 77.654, lag 168.849 ms, 31 skipped
progress: 4073.3 s, 1.3 tps, lat 1298.093 ms stddev 0.000, lag 168.882 ms, 21 skipped
progress: 4074.0 s, 2.7 tps, lat 920.704 ms stddev 183.587, lag 165.333 ms, 24 skipped
progress: 4075.3 s, 2.4 tps, lat 756.051 ms stddev 118.241, lag 176.863 ms, 29 skipped
progress: 4076.1 s, 1.3 tps, lat 1424.548 ms stddev 0.000, lag 0.000 ms, 17 skipped
progress: 4077.3 s, 2.4 tps, lat 791.090 ms stddev 338.729, lag 155.403 ms, 26 skipped
progress: 4078.1 s, 27.4 tps, lat 46.834 ms stddev 198.812, lag 3.915 ms, 0 skipped

--
Fabien

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-30 17:19:59
Message-ID:	20140830171959.GA2327@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
> them out in order (http://www.postgresql.org/message-id/flat/20070614153758(dot)6A62(dot)ITAGAKI(dot)TAKAHIRO(at)oss(dot)ntt(dot)co(dot)jp).
> The performance impact of that was inconclusive, but one thing that it
> allows nicely is to interleave the fsyncs, so that you write all the buffers
> for one file, then fsync it, then next file and so on. IIRC the biggest
> worry with that patch was that sorting the buffers requires a fairly large
> amount of memory, and making a large allocation in the checkpointer might
> cause an out-of-memory, which would be bad.
>
> I don't think anyone's seriously worked on this area since. If the impact on
> responsiveness or performance is significant, I'm pretty sure the OOM
> problem could be alleviated somehow.

I've dug up that patch (after a bit of fighting with the archives) and
refreshed it. It's *clearly* beneficial:

master:
andres(at)awork2:~/src/postgresql$ pgbench -p 5440 -h /tmp postgres -M prepared -c 180 -j 180 -T 180 -L 100 --progress=1
starting vacuum...end.
progress: 1.0 s, 2847.6 tps, lat 53.862 ms stddev 49.219
...
progress: 67.0 s, 3435.4 tps, lat 52.920 ms stddev 48.720
progress: 68.2 s, 2586.9 tps, lat 57.793 ms stddev 64.228
progress: 69.1 s, 546.5 tps, lat 294.940 ms stddev 189.546
progress: 70.0 s, 1741.3 tps, lat 134.298 ms stddev 204.740
progress: 71.0 s, 3868.8 tps, lat 48.423 ms stddev 47.934
..
progress: 89.0 s, 4022.8 tps, lat 45.601 ms stddev 40.685
progress: 90.0 s, 2463.5 tps, lat 61.907 ms stddev 64.342
progress: 91.2 s, 856.3 tps, lat 211.610 ms stddev 149.916
progress: 92.0 s, 1026.9 tps, lat 177.103 ms stddev 144.448
progress: 93.0 s, 736.9 tps, lat 254.230 ms stddev 227.339
progress: 94.1 s, 766.9 tps, lat 208.031 ms stddev 181.340
progress: 95.1 s, 979.7 tps, lat 197.014 ms stddev 193.648
progress: 96.0 s, 868.9 tps, lat 214.060 ms stddev 198.297
progress: 97.1 s, 943.4 tps, lat 178.062 ms stddev 143.224
progress: 98.0 s, 934.5 tps, lat 192.435 ms stddev 197.901
progress: 99.6 s, 623.1 tps, lat 202.954 ms stddev 165.576
progress: 100.0 s, 464.7 tps, lat 683.600 ms stddev 376.520
progress: 101.1 s, 516.0 tps, lat 395.033 ms stddev 480.346
progress: 102.0 s, 364.9 tps, lat 507.933 ms stddev 499.670
progress: 103.3 s, 592.9 tps, lat 214.123 ms stddev 273.411
progress: 104.1 s, 930.2 tps, lat 316.487 ms stddev 335.096
progress: 105.4 s, 627.6 tps, lat 262.496 ms stddev 200.690
progress: 106.1 s, 788.6 tps, lat 235.510 ms stddev 202.366
progress: 107.5 s, 644.8 tps, lat 269.020 ms stddev 223.900
progress: 108.0 s, 725.0 tps, lat 262.692 ms stddev 218.774
progress: 109.0 s, 660.0 tps, lat 272.808 ms stddev 248.501
progress: 110.0 s, 604.3 tps, lat 303.727 ms stddev 264.921
progress: 111.0 s, 723.6 tps, lat 243.224 ms stddev 229.426
progress: 112.1 s, 668.6 tps, lat 257.026 ms stddev 190.247
progress: 113.1 s, 345.0 tps, lat 492.114 ms stddev 312.082
progress: 115.4 s, 390.9 tps, lat 416.708 ms stddev 391.577
progress: 115.4 s, 14598.5 tps, lat 551.617 ms stddev 539.611
progress: 116.1 s, 161.5 tps, lat 741.611 ms stddev 485.498
progress: 117.4 s, 269.1 tps, lat 697.978 ms stddev 576.970
progress: 118.8 s, 262.3 tps, lat 674.887 ms stddev 587.848
progress: 119.1 s, 195.2 tps, lat 833.959 ms stddev 733.592
progress: 120.0 s, 3000.6 tps, lat 104.272 ms stddev 291.851
progress: 121.0 s, 3167.7 tps, lat 56.576 ms stddev 51.976
progress: 122.0 s, 3398.1 tps, lat 51.322 ms stddev 48.057
progress: 123.0 s, 3721.9 tps, lat 50.355 ms stddev 46.994
progress: 124.0 s, 2929.3 tps, lat 50.996 ms stddev 45.553
progress: 125.0 s, 754.5 tps, lat 269.293 ms stddev 287.508
progress: 126.0 s, 3297.0 tps, lat 56.912 ms stddev 77.053
...
progress: 144.0 s, 4207.9 tps, lat 44.440 ms stddev 37.210
progress: 145.9 s, 2036.4 tps, lat 79.025 ms stddev 105.411
progress: 146.0 s, 1003.1 tps, lat 292.934 ms stddev 223.650
progress: 147.4 s, 520.8 tps, lat 318.670 ms stddev 244.596
progress: 148.0 s, 3557.3 tps, lat 71.626 ms stddev 143.174
progress: 149.0 s, 4106.1 tps, lat 43.557 ms stddev 36.444
progress: 150.0 s, 4132.3 tps, lat 43.185 ms stddev 34.611
progress: 151.0 s, 4233.3 tps, lat 43.239 ms stddev 39.121
progress: 152.0 s, 4178.2 tps, lat 43.242 ms stddev 40.377
progress: 153.0 s, 755.1 tps, lat 198.560 ms stddev 155.927
progress: 154.1 s, 773.6 tps, lat 240.044 ms stddev 192.472
progress: 155.0 s, 553.7 tps, lat 303.532 ms stddev 245.491
progress: 156.2 s, 772.7 tps, lat 242.925 ms stddev 226.754
progress: 157.1 s, 541.0 tps, lat 295.132 ms stddev 218.857
progress: 158.1 s, 716.8 tps, lat 281.823 ms stddev 227.488
progress: 159.1 s, 748.7 tps, lat 223.275 ms stddev 186.162
progress: 160.0 s, 503.0 tps, lat 311.621 ms stddev 215.952
progress: 161.1 s, 905.0 tps, lat 239.623 ms stddev 245.539
progress: 162.4 s, 360.4 tps, lat 329.583 ms stddev 250.094
progress: 163.3 s, 348.9 tps, lat 583.476 ms stddev 432.200
progress: 165.5 s, 186.1 tps, lat 765.542 ms stddev 552.133
progress: 165.5 s, 9950.2 tps, lat 1424.984 ms stddev 240.249
progress: 166.4 s, 137.3 tps, lat 1056.184 ms stddev 765.356
progress: 167.2 s, 249.7 tps, lat 793.423 ms stddev 792.666
progress: 168.6 s, 194.0 tps, lat 851.712 ms stddev 737.653
progress: 169.2 s, 136.0 tps, lat 1229.394 ms stddev 1227.723
progress: 170.6 s, 92.3 tps, lat 1569.107 ms stddev 1517.027
progress: 171.2 s, 276.5 tps, lat 1325.410 ms stddev 1252.237
progress: 172.0 s, 3881.2 tps, lat 91.122 ms stddev 366.052
progress: 174.4 s, 995.6 tps, lat 120.200 ms stddev 242.005
progress: 174.4 s, 16227.2 tps, lat 651.361 ms stddev 527.946
progress: 175.1 s, 200.0 tps, lat 808.114 ms stddev 542.989
progress: 176.1 s, 179.4 tps, lat 1024.893 ms stddev 724.887
progress: 177.4 s, 367.0 tps, lat 648.519 ms stddev 790.413
progress: 178.4 s, 2236.7 tps, lat 114.064 ms stddev 271.525
progress: 179.0 s, 3643.5 tps, lat 48.404 ms stddev 47.793
progress: 180.0 s, 3309.0 tps, lat 55.598 ms stddev 49.211
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: prepared
number of clients: 180
number of threads: 180
duration: 180 s
number of transactions actually processed: 496443
latency average: 65.245 ms
latency stddev: 134.292 ms
tps = 2756.518300 (including connections establishing)
tps = 2756.718588 (excluding connections establishing)

LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 139127 buffers (21.2%); 0 transaction log file(s) added, 0 removed, 8 recycled; write=53.660 s, sync=6.739 s, total=60.611 s; sync files=18, longest=4.382 s, average=0.374 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 70472 buffers (10.8%); 0 transaction log file(s) added, 0 removed, 70 recycled; write=53.107 s, sync=5.375 s, total=58.605 s; sync files=15, longest=3.243 s, average=0.358 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 80357 buffers (12.3%); 0 transaction log file(s) added, 0 removed, 49 recycled; write=53.526 s, sync=3.896 s, total=57.573 s; sync files=15, longest=1.206 s, average=0.259 s

sorted writes:
starting vacuum...end.
progress: 1.0 s, 3237.2 tps, lat 49.544 ms stddev 42.203
..
progress: 88.0 s, 3030.8 tps, lat 47.438 ms stddev 46.563
progress: 89.3 s, 919.0 tps, lat 204.152 ms stddev 172.776
progress: 90.1 s, 973.7 tps, lat 204.650 ms stddev 197.843
progress: 91.1 s, 875.5 tps, lat 200.472 ms stddev 157.295
progress: 92.0 s, 939.5 tps, lat 185.357 ms stddev 146.362
progress: 93.1 s, 975.8 tps, lat 184.445 ms stddev 167.520
progress: 94.9 s, 673.0 tps, lat 260.900 ms stddev 224.125
progress: 95.0 s, 1057.6 tps, lat 223.717 ms stddev 260.276
progress: 96.0 s, 917.7 tps, lat 201.704 ms stddev 220.395
progress: 97.0 s, 2573.0 tps, lat 78.913 ms stddev 119.458
progress: 98.0 s, 3921.5 tps, lat 44.160 ms stddev 39.147
progress: 99.0 s, 4033.3 tps, lat 43.972 ms stddev 45.921
..
progress: 110.0 s, 4057.8 tps, lat 44.283 ms stddev 47.003
progress: 111.0 s, 897.6 tps, lat 158.250 ms stddev 132.126
progress: 112.0 s, 751.6 tps, lat 249.282 ms stddev 232.964
progress: 113.0 s, 550.6 tps, lat 296.155 ms stddev 226.163
progress: 114.0 s, 1875.3 tps, lat 123.059 ms stddev 210.661
progress: 115.1 s, 4555.4 tps, lat 39.404 ms stddev 32.331
..
progress: 151.0 s, 4240.2 tps, lat 42.158 ms stddev 38.577
progress: 152.1 s, 2551.3 tps, lat 63.584 ms stddev 73.780
progress: 153.1 s, 912.5 tps, lat 189.030 ms stddev 146.658
progress: 154.2 s, 976.2 tps, lat 192.499 ms stddev 158.108
progress: 155.0 s, 1005.1 tps, lat 178.215 ms stddev 129.062
progress: 156.0 s, 1010.1 tps, lat 175.583 ms stddev 138.823
progress: 157.0 s, 1042.0 tps, lat 161.675 ms stddev 135.350
progress: 158.1 s, 766.8 tps, lat 216.367 ms stddev 234.374
progress: 159.5 s, 1025.7 tps, lat 199.366 ms stddev 228.530
progress: 160.0 s, 1178.6 tps, lat 159.156 ms stddev 130.660
progress: 161.0 s, 2904.1 tps, lat 65.751 ms stddev 79.928
progress: 162.0 s, 4189.6 tps, lat 43.492 ms stddev 38.490
..
progress: 170.0 s, 4150.9 tps, lat 41.998 ms stddev 40.931
progress: 171.1 s, 595.5 tps, lat 233.138 ms stddev 188.073
progress: 172.0 s, 622.6 tps, lat 245.023 ms stddev 279.715
progress: 173.0 s, 588.1 tps, lat 359.104 ms stddev 315.367
progress: 174.0 s, 3011.0 tps, lat 74.811 ms stddev 172.972
progress: 175.0 s, 4829.0 tps, lat 37.520 ms stddev 30.578
progress: 176.0 s, 4324.1 tps, lat 41.388 ms stddev 37.727
progress: 177.0 s, 3632.8 tps, lat 49.219 ms stddev 47.092
progress: 178.0 s, 3621.0 tps, lat 50.332 ms stddev 43.086
progress: 179.0 s, 3468.3 tps, lat 50.969 ms stddev 43.025
progress: 180.0 s, 3710.7 tps, lat 48.782 ms stddev 49.070
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: prepared
number of clients: 180
number of threads: 180
duration: 180 s
number of transactions actually processed: 634014
latency average: 51.089 ms
latency stddev: 63.864 ms
number of transactions above the 100.0 ms latency limit: 66921
tps = 3520.574003 (excluding connections establishing)

LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 135453 buffers (20.7%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=53.419 s, sync=3.665 s, total=57.390 s; sync files=17, longest=1.937 s, average=0.215 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 96907 buffers (14.8%); 0 transaction log file(s) added, 0 removed, 68 recycled; write=53.147 s, sync=3.369 s, total=56.633 s; sync files=15, longest=1.920 s, average=0.224 s

sorted writes + flush in SyncOneBuffer():

andres(at)awork2:~/src/postgresql$ pgbench -p 5440 -h /tmp postgres -M prepared -c 180 -j 180 -T 180 -L 100 --progress=1
starting vacuum...end.
progress: 1.0 s, 3044.7 tps, lat 49.030 ms stddev 45.608
progress: 2.0 s, 3701.3 tps, lat 48.231 ms stddev 44.762
...
progress: 55.0 s, 4157.1 tps, lat 43.216 ms stddev 40.821
progress: 56.0 s, 4736.3 tps, lat 38.326 ms stddev 35.699
progress: 57.0 s, 2604.9 tps, lat 66.072 ms stddev 65.534
progress: 58.0 s, 1659.8 tps, lat 113.848 ms stddev 121.625
progress: 59.0 s, 2625.2 tps, lat 66.112 ms stddev 54.349
..
progress: 106.0 s, 2571.8 tps, lat 72.806 ms stddev 83.308
progress: 107.0 s, 1818.7 tps, lat 93.315 ms stddev 103.284
progress: 108.1 s, 1876.6 tps, lat 97.447 ms stddev 103.151
progress: 109.1 s, 1768.8 tps, lat 110.406 ms stddev 121.347
progress: 110.0 s, 3129.4 tps, lat 56.100 ms stddev 54.835
progress: 111.0 s, 4402.0 tps, lat 40.581 ms stddev 42.269
progress: 112.0 s, 4379.3 tps, lat 42.625 ms stddev 39.507
progress: 113.0 s, 4551.6 tps, lat 39.303 ms stddev 30.563
progress: 114.0 s, 4383.3 tps, lat 41.169 ms stddev 37.459
progress: 115.1 s, 4652.3 tps, lat 38.769 ms stddev 33.415
progress: 116.1 s, 4015.5 tps, lat 41.939 ms stddev 34.122
progress: 117.1 s, 2159.0 tps, lat 84.653 ms stddev 81.445
progress: 118.0 s, 2971.3 tps, lat 62.874 ms stddev 62.229
progress: 119.0 s, 2825.1 tps, lat 60.875 ms stddev 54.767
progress: 120.0 s, 2542.9 tps, lat 75.276 ms stddev 68.709
progress: 121.1 s, 2650.4 tps, lat 63.833 ms stddev 56.419
progress: 122.0 s, 2766.8 tps, lat 68.196 ms stddev 67.042
..
progress: 166.0 s, 2412.4 tps, lat 71.307 ms stddev 74.961
progress: 167.1 s, 1742.5 tps, lat 99.519 ms stddev 102.500
progress: 168.1 s, 1324.2 tps, lat 122.625 ms stddev 135.177
progress: 169.0 s, 2002.5 tps, lat 108.435 ms stddev 125.576
progress: 170.0 s, 2796.6 tps, lat 64.867 ms stddev 64.665
progress: 171.0 s, 3969.1 tps, lat 45.520 ms stddev 41.656
...
progress: 180.0 s, 3003.9 tps, lat 59.134 ms stddev 51.508
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: prepared
number of clients: 180
number of threads: 180
duration: 180 s
number of transactions actually processed: 584725
latency average: 55.375 ms
latency stddev: 56.202 ms
number of transactions above the 100.0 ms latency limit: 76644
tps = 3246.299654 (including connections establishing)
tps = 3246.607040 (excluding connections establishing)

LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 135474 buffers (20.7%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=53.260 s, sync=0.079 s, total=53.577 s; sync files=19, longest=0.032 s, average=0.004 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 92770 buffers (14.2%); 0 transaction log file(s) added, 0 removed, 68 recycled; write=53.677 s, sync=0.239 s, total=54.026 s; sync files=16, longest=0.093 s, average=0.014 s

So, *very* clearly sorting is a benefit. Both the dips in performance
are lower and the aggregate throughput is higher.

Additionally flushing buffers during the flush is a clear win from the
perspective of reliable performance, but noticeable
performancewise. Although much less than without sorting.

Now, I've used a absurdly small checkpoint_timeout here to have the time
to run these benchmarks. But I'm damn sure this similar with real
settings.

So, what I think we need to do is:

1) Sort writes on checkpoints. It's a clear win. Personally I'm
perfectly happy waving away the concerns about OOM. It's not that much
memory:
# SELECT pg_size_pretty(24 * ((64::int8 * 1024 * 1024 * 1024) /
8192::int8));
┌────────────────┐
│ pg_size_pretty │
├────────────────┤
│ 192 MB │
└────────────────┘

If you can aford 64GB shared buffers, 192MB memory allocated in the
checkpointer hardly is a problem. (24 is sizeof(BufferTag) + sizeof(bufid))

2) At least optionally flush dirty buffers during the buffer sync
scan. The improvements in steady performance are too big to ignore. I
think we might be able to make it an unconditional win by flushing in
batches instead of doing so for each buffer like I've done.

3) Possibly do the mdsync() after & for each filenode (or segment?) like
Heikki basically just prototyped. That'll reduce the number of dirty
buffers our fsync() will flush out unnecessarily (which will create
stalls because they'll sit in the kernel's buffer cache).

I think that'll perform horribly on large shared buffers with many
relations. Not a uncommon scenario...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-30 17:50:40
Message-ID:	6599.1409421040@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
>> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
>> them out in order (http://www.postgresql.org/message-id/flat/20070614153758(dot)6A62(dot)ITAGAKI(dot)TAKAHIRO(at)oss(dot)ntt(dot)co(dot)jp).
>> The performance impact of that was inconclusive, but one thing that it
>> allows nicely is to interleave the fsyncs, so that you write all the buffers
>> for one file, then fsync it, then next file and so on.

> ...
> So, *very* clearly sorting is a benefit.

pg_bench alone doesn't convince me on this. The original thread found
cases where it was a loss, IIRC; you will need to test many more than
one scenario to prove the point.

Also, it does not matter how good it looks in test cases if it causes
outright failures due to OOM; unlike you, I am not prepared to just "wave
away" that risk. A possible compromise is to sort a limited number of
buffers ---- say, collect a few thousand dirty buffers then sort, dump and
fsync them, repeat as needed.

regards, tom lane

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-30 18:04:05
Message-ID:	20140830180405.GB25523@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
> >> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
> >> them out in order (http://www.postgresql.org/message-id/flat/20070614153758(dot)6A62(dot)ITAGAKI(dot)TAKAHIRO(at)oss(dot)ntt(dot)co(dot)jp).
> >> The performance impact of that was inconclusive, but one thing that it
> >> allows nicely is to interleave the fsyncs, so that you write all the buffers
> >> for one file, then fsync it, then next file and so on.
>
> > ...
> > So, *very* clearly sorting is a benefit.
>
> pg_bench alone doesn't convince me on this. The original thread found
> cases where it was a loss, IIRC; you will need to test many more than
> one scenario to prove the point.

Sure. And I'm not claiming Itagaki/your old patch is immediately going
to be ready for commit. But our checkpoint performance has sucked for
years in the field. I don't think we can wave that away.

I think the primary reason it wasn't easily visible as being beneficial
back then was that only the throughput, not the latency and such were
looked at.

> Also, it does not matter how good it looks in test cases if it causes
> outright failures due to OOM; unlike you, I am not prepared to just "wave
> away" that risk.

I'm not "waving away" any risks.

If the sort buffer is allocated when the checkpointer is started, not
everytime we sort, as you've done in your version of the patch I think
that risk is pretty manageable. If we really want to be sure nothing is
happening at runtime, even if the checkpointer was restarted, we can put
the sort array in shared memory.

We're talking about (sizeof(BufferTag) + sizeof(int))/8192 ~= 0.3 %
overhead over shared_buffers here. If we decide to got that way, it's a
pretty darn small to price not to regularly have stalls that last
minutes.

> A possible compromise is to sort a limited number of
> buffers ---- say, collect a few thousand dirty buffers then sort, dump and
> fsync them, repeat as needed.

Yea, that's what I suggested nearby. But I don't really like it, because
it robs us of the the chance to fsync() a relfilenode immediately after
having synced all its buffers. Doing so is rather beneficial because
then fewer independently dirtied pages have to be flushed out - reducing
the impact of the fsync().

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-30 18:16:10
Message-ID:	7226.1409422570@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
>> A possible compromise is to sort a limited number of
>> buffers ---- say, collect a few thousand dirty buffers then sort, dump and
>> fsync them, repeat as needed.

> Yea, that's what I suggested nearby. But I don't really like it, because
> it robs us of the the chance to fsync() a relfilenode immediately after
> having synced all its buffers.

Uh, how so exactly? You could still do that. Yeah, you might fsync a rel
once per sort-group and not just once per checkpoint, but it's not clear
that that's a loss as long as the group size isn't tiny.

regards, tom lane

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-30 18:45:59
Message-ID:	20140830184558.GA31166@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-30 14:16:10 -0400, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
> >> A possible compromise is to sort a limited number of
> >> buffers ---- say, collect a few thousand dirty buffers then sort, dump and
> >> fsync them, repeat as needed.
>
> > Yea, that's what I suggested nearby. But I don't really like it, because
> > it robs us of the the chance to fsync() a relfilenode immediately after
> > having synced all its buffers.
>
> Uh, how so exactly? You could still do that. Yeah, you might fsync a rel
> once per sort-group and not just once per checkpoint, but it's not clear
> that that's a loss as long as the group size isn't tiny.

Because it wouldn't have the benefit of sycing the minimal amount of
data anymore. If lots of other relfilenodes have been synced inbetween
the amount of newly dirtied pages in the os' buffercache (written by
backends, bgwriter) for a individual relfilenode is much higher.

A fsync() on a file with dirty data often causes *serious* latency
spikes - we should try hard to avoid superflous calls.

As an example: Calling fsync() on pgbench_accounts's underlying files,
from outside postgres, *before* postgres even started its first
checkpoint does this:
progress: 72.0 s, 4324.9 tps, lat 41.481 ms stddev 40.567
progress: 73.0 s, 4704.9 tps, lat 38.465 ms stddev 35.436
progress: 74.0 s, 4448.5 tps, lat 40.058 ms stddev 32.634
progress: 75.0 s, 4634.5 tps, lat 39.229 ms stddev 33.463
progress: 76.8 s, 2753.1 tps, lat 48.693 ms stddev 75.309
progress: 77.1 s, 126.6 tps, lat 773.433 ms stddev 222.667
progress: 78.0 s, 183.7 tps, lat 786.401 ms stddev 395.954
progress: 79.1 s, 170.3 tps, lat 975.949 ms stddev 596.751
progress: 80.0 s, 2116.6 tps, lat 168.608 ms stddev 398.933
progress: 81.0 s, 4436.1 tps, lat 40.313 ms stddev 34.198
progress: 82.0 s, 4383.9 tps, lat 41.811 ms stddev 37.241

Note the dip from 4k tps to 130 tps.

We can get a handle on that (on some platforms at least) for writes
issued during the buffer sync by forcing the kernel to write out the
pages in small increments; but I doubt we want to do that for writes by
backends themselves.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-30 22:50:48
Message-ID:	54025548.3030800@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 08/30/2014 09:45 PM, Andres Freund wrote:
> On 2014-08-30 14:16:10 -0400, Tom Lane wrote:
>> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
>>> On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
>>>> A possible compromise is to sort a limited number of
>>>> buffers ---- say, collect a few thousand dirty buffers then sort, dump and
>>>> fsync them, repeat as needed.
>>
>>> Yea, that's what I suggested nearby. But I don't really like it, because
>>> it robs us of the the chance to fsync() a relfilenode immediately after
>>> having synced all its buffers.
>>
>> Uh, how so exactly? You could still do that. Yeah, you might fsync a rel
>> once per sort-group and not just once per checkpoint, but it's not clear
>> that that's a loss as long as the group size isn't tiny.
>
> Because it wouldn't have the benefit of sycing the minimal amount of
> data anymore. If lots of other relfilenodes have been synced inbetween
> the amount of newly dirtied pages in the os' buffercache (written by
> backends, bgwriter) for a individual relfilenode is much higher.

I wonder how much of the benefit from sorting comes from sorting the
pages within each file, and how much just from grouping all the writes
of each file together. In other words, how much difference is there
between sorting, and fsyncing between each file, or the crude patch I
posted earlier.

If we're going to fsync between each file, there's no need to sort all
the buffers at once. It's enough to pick one file as the target - like
in my crude patch - and sort only the buffers for that file. Then fsync
that file and move on to the next file. That requires scanning the
buffers multiple times, but I think that's OK.

- Heikki

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-30 23:10:38
Message-ID:	20140830231038.GC25523@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-08-31 01:50:48 +0300, Heikki Linnakangas wrote:
> On 08/30/2014 09:45 PM, Andres Freund wrote:
> >On 2014-08-30 14:16:10 -0400, Tom Lane wrote:
> >>Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> >>>On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
> >>>>A possible compromise is to sort a limited number of
> >>>>buffers ---- say, collect a few thousand dirty buffers then sort, dump and
> >>>>fsync them, repeat as needed.
> >>
> >>>Yea, that's what I suggested nearby. But I don't really like it, because
> >>>it robs us of the the chance to fsync() a relfilenode immediately after
> >>>having synced all its buffers.
> >>
> >>Uh, how so exactly? You could still do that. Yeah, you might fsync a rel
> >>once per sort-group and not just once per checkpoint, but it's not clear
> >>that that's a loss as long as the group size isn't tiny.
> >
> >Because it wouldn't have the benefit of sycing the minimal amount of
> >data anymore. If lots of other relfilenodes have been synced inbetween
> >the amount of newly dirtied pages in the os' buffercache (written by
> >backends, bgwriter) for a individual relfilenode is much higher.
>
> I wonder how much of the benefit from sorting comes from sorting the pages
> within each file, and how much just from grouping all the writes of each
> file together. In other words, how much difference is there between sorting,
> and fsyncing between each file, or the crude patch I posted earlier.

I haven't implemented fsync()ing between files so far. From the io stats
I'm seing the performance improvements come from the OS being able to
write data back in bigger chunks. Which seems entirely reasonable. If
the database and the write load are big enough, so writeback will be
triggered repeatedly during one checkpoint, the OS's buffercache will
have lots of nonsequential data to flush. Leading to much smaller IOs,
more seeks and deeper queues (=> latency increases).

> If we're going to fsync between each file, there's no need to sort all the
> buffers at once. It's enough to pick one file as the target - like in my
> crude patch - and sort only the buffers for that file. Then fsync that file
> and move on to the next file. That requires scanning the buffers multiple
> times, but I think that's OK.

I really can't see that working out. Production instances of postgres
with large shared_buffers settings (say 96GB in one case) have tens of
thousands of relations (~34500 in the same case). And that's a database
with a relatively simple schema. I've seen much worse.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Mitsumasa KONDO <kondo(dot)mitsumasa(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-31 02:02:48
Message-ID:	CADupcHXqXATH6y-BZUu6qGqkfPgbrPi_5mAQ0+8XFoCuMvG=LA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

2014-08-31 8:10 GMT+09:00 Andres Freund <andres(at)2ndquadrant(dot)com>:

> On 2014-08-31 01:50:48 +0300, Heikki Linnakangas wrote:
>
> If we're going to fsync between each file, there's no need to sort all the
> > buffers at once. It's enough to pick one file as the target - like in my
> > crude patch - and sort only the buffers for that file. Then fsync that
> file
> > and move on to the next file. That requires scanning the buffers multiple
> > times, but I think that's OK.
>
> I really can't see that working out. Production instances of postgres
> with large shared_buffers settings (say 96GB in one case) have tens of
> thousands of relations (~34500 in the same case). And that's a database
> with a relatively simple schema. I've seen much worse.

Yeah, it is impossible in one checkpointer process. All buffer search cost
is
relatively high than we expect. We need clever algorithm for efficient and
distributed buffer search using multi process or threads.

Regards,
--
Mitsumasa KONDO

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-08-31 06:44:01
Message-ID:	alpine.DEB.2.10.1408310826420.14824@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Heikki,

> For the kicks, I wrote a quick & dirty patch for interleaving the fsyncs, see
> attached. It works by repeatedly scanning the buffer pool, writing buffers
> belonging to a single relation segment at a time.

I tried this patch on the same host I used with the same "-R 25 -L 200 -T
5000", alas without significant positive effects, about 6% of transactions
were "lost", including stretches of several seconds of unresponsiveness.

Maybe this is because "pgbench -N" only basically touches 2 tables
(accounts & history), so there are few file syncs involved, thus it does
not help much with spreading a lot of writes.

2014-08-30 23:52:24.167 CEST:
LOG: checkpoint starting: xlog
2014-08-30 23:54:09.239 CEST:
LOG: checkpoint sync: number=1 file=base/16384/11902 time=24.529 msec
2014-08-30 23:54:23.812 CEST:
LOG: checkpoint sync: number=1 file=base/16384/16397_vm time=32.547 msec
2014-08-30 23:54:24.771 CEST:
LOG: checkpoint sync: number=1 file=base/16384/11873 time=557.470 msec
2014-08-30 23:54:37.419 CEST:
LOG: checkpoint complete: wrote 4931 buffers (30.1%);
0 transaction log file(s) added, 0 removed, 3 recycled;
write=122.946 s, sync=10.129 s, total=133.252 s;
sync files=6, longest=9.854 s, average=1.790 s

Note that given the small load and table size, pgbench implies random
I/Os and basically nothing else.

--
Fabien.

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-09-02 15:04:35
Message-ID:	CAMkU=1xf3gjRoY9xyxuCqU70YzuWm7fPLivRAGd-TXFOErrF-A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 26, 2014 at 1:02 AM, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:

>
> Hello again,
>
>
> I have not found any mean to force bgwriter to send writes when it can.
>>> (Well, I have: create a process which sends "CHECKPOINT" every 0.2
>>> seconds... it works more or less, but this is not my point:-)
>>>
>>
>> There is scan_whole_pool_milliseconds, which currently forces bgwriter to
>> circle the buffer pool at least once every 2 minutes. It is currently
>> fixed, but it should be trivial to turn it into an experimental guc that
>> you could use to test your hypothesis.
>>
>
> I recompiled with the variable coldly set to 1000 instead of 120000. The
> situation is slightly degraded (15% of transactions were above 200 ms
> late). However it seems that bgwriter did not write much more pages:
>

You should probably try it set to 200 rather than 1000, to put it on an
equal footing with the checkpoint_timeout of 0.2 seconds you reported on.

Not that I think this will improve the situation. Afterall, my theory is
that it does not matter who *writes* the pages, it only matters how they
get fsynced.

>
> buffers_checkpoint = 26065
> buffers_clean = 5263
> buffers_backend = 367
>
> Or I may have a problem interpreting pg_stat_bgwriter.
>

For this experiment, what was checkpoint_timeout set to?

Cheers,

Jeff

From:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-09-02 15:14:19
Message-ID:	alpine.DEB.2.10.1409021709450.19169@sto
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> There is scan_whole_pool_milliseconds, which currently forces bgwriter to
>>> circle the buffer pool at least once every 2 minutes. It is currently
>>> fixed, but it should be trivial to turn it into an experimental guc that
>>> you could use to test your hypothesis.
>>
>> I recompiled with the variable coldly set to 1000 instead of 120000. The
>> situation is slightly degraded (15% of transactions were above 200 ms
>> late). However it seems that bgwriter did not write much more pages:
>
>
> You should probably try it set to 200 rather than 1000, to put it on an
> equal footing with the checkpoint_timeout of 0.2 seconds you reported on.

As I understand it, the setting makes the bgwriter processe scan all
shared_buffers every this amount of time... but ITSM that the key point is
that bgwriter has no insentive to start writing out buffers anyway with
its current decision rules, and that should not change with the frequency
at which they are scanned (?)

> For this experiment, what was checkpoint_timeout set to?

AFAICR, the default, 5min.

--
Fabien.

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc:	PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-09-02 15:36:19
Message-ID:	CAMkU=1xWqKbaE_kX0tAgOt5j-hZsOkqqdqya9m18r1Ft8G4SUA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 2, 2014 at 8:14 AM, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> wrote:

>
> There is scan_whole_pool_milliseconds, which currently forces bgwriter to
>>>> circle the buffer pool at least once every 2 minutes. It is currently
>>>> fixed, but it should be trivial to turn it into an experimental guc that
>>>> you could use to test your hypothesis.
>>>>
>>>
>>> I recompiled with the variable coldly set to 1000 instead of 120000. The
>>> situation is slightly degraded (15% of transactions were above 200 ms
>>> late). However it seems that bgwriter did not write much more pages:
>>>
>>
>>
>> You should probably try it set to 200 rather than 1000, to put it on an
>> equal footing with the checkpoint_timeout of 0.2 seconds you reported on.
>>
>
> As I understand it, the setting makes the bgwriter processe scan all
> shared_buffers every this amount of time... but ITSM that the key point is
> that bgwriter has no insentive to start writing out buffers anyway with its
> current decision rules, and that should not change with the frequency at
> which they are scanned (?)

Ah, I see now. The usage counts are not zero, so it visits the buffer and
then leaves it alone.

Cheers,

Jeff

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-09-03 21:08:12
Message-ID:	CA+TgmoakiQSFHb8BhoKeiM8Akz6k-oiQG05tx+npP4f6e19b1Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 30, 2014 at 2:04 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> If the sort buffer is allocated when the checkpointer is started, not
> everytime we sort, as you've done in your version of the patch I think
> that risk is pretty manageable. If we really want to be sure nothing is
> happening at runtime, even if the checkpointer was restarted, we can put
> the sort array in shared memory.

I don't think that allocating the array at checkpointer start time
helps. If it works, then you're strictly worse off than if you
allocate it at every checkpoint, because you're holding onto the
memory all the time instead of only when it's being used. And if it
fails, what then? Sure, you can have that copy of the checkpointer
process exit, but that does nothing good. The postmaster will keep on
restarting it and it will keep on dying for lack of memory, and no
checkpoints will complete. Oops.

So it seems to me that the possibly-sensible approaches are:

1. Allocate an array when we need to sort, and if the allocation
fails, have some kind of fallback strategy, like logging a WARNING an
writing the buffers out without sorting them. If it succeeds, do the
checkpoint and then free the memory until we need it again.

2. Putting the array in shared_memory, so that once the server is
started, we can be sure the memory is allocated and the sort will
work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-09-03 21:36:10
Message-ID:	20140903213610.GA3625@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2014-09-03 17:08:12 -0400, Robert Haas wrote:
> On Sat, Aug 30, 2014 at 2:04 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > If the sort buffer is allocated when the checkpointer is started, not
> > everytime we sort, as you've done in your version of the patch I think
> > that risk is pretty manageable. If we really want to be sure nothing is
> > happening at runtime, even if the checkpointer was restarted, we can put
> > the sort array in shared memory.
>
> I don't think that allocating the array at checkpointer start time
> helps. If it works, then you're strictly worse off than if you
> allocate it at every checkpoint, because you're holding onto the
> memory all the time instead of only when it's being used. And if it
> fails, what then? Sure, you can have that copy of the checkpointer
> process exit, but that does nothing good. The postmaster will keep on
> restarting it and it will keep on dying for lack of memory, and no
> checkpoints will complete. Oops.

It's imo quite clearly better to keep it allocated. For one after
postmaster started the checkpointer successfully you don't need to be
worried about later failures to allocate memory if you allocate it once
(unless the checkpointer FATALs out which should be exceedingly rare -
we're catching ERRORs). It's much much more likely to succeed
initially. Secondly it's not like there's really that much time where no
checkpointer isn't running.

> So it seems to me that the possibly-sensible approaches are:
>
> 1. Allocate an array when we need to sort, and if the allocation
> fails, have some kind of fallback strategy, like logging a WARNING an
> writing the buffers out without sorting them. If it succeeds, do the
> checkpoint and then free the memory until we need it again.

I think if we want to go that way I vote for keeping the array allocated
and continuing to try to allocate it after allocation failures. And, as
you suggest, fall back to a simple sequential search through all
buffers.

> 2. Putting the array in shared_memory, so that once the server is
> started, we can be sure the memory is allocated and the sort will
> work.

But I prefer this approach. If we ever want to have more than one
process writing out data for checkpoints we're going to need it
anyway. And that's something not that far away for large setups
imo. Especially due to checksums.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Ants Aasma <ants(at)cybertec(dot)at>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-09-04 06:59:04
Message-ID:	CA+CSw_ug+QifFh9dmgdMNj8sL+PsSf5pv5Vb6Ev9qMv4g9Px2A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 30, 2014 at 8:50 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
>> On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
>>> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
>>> them out in order (http://www.postgresql.org/message-id/flat/20070614153758(dot)6A62(dot)ITAGAKI(dot)TAKAHIRO(at)oss(dot)ntt(dot)co(dot)jp).
>>> The performance impact of that was inconclusive, but one thing that it
>>> allows nicely is to interleave the fsyncs, so that you write all the buffers
>>> for one file, then fsync it, then next file and so on.
>
>> ...
>> So, *very* clearly sorting is a benefit.
>
> pg_bench alone doesn't convince me on this. The original thread found
> cases where it was a loss, IIRC; you will need to test many more than
> one scenario to prove the point.

The same objection came up last time I tried to push for sorted
checkpoints. I did not find any reference to where it caused a loss,
nor was I able to come up with a case where writing out in arbitrary
order would be better than writing out in file sequential order. In
fact if we ask for low latency this means that the OS must keep the
backlog small eliminating any chance of write combining writes that
arrive out of order.

I have a use case where the system continuously loads data into time
partitioned indexed tables, at every checkpoint all of the indexes of
the latest partition need to be written out. The only way I could get
the write out to happen with sequential I/O was to set
checkpoint_completion_target to zero and ensure OS cache allows for
enough dirty pages to absorb the whole checkpoint. The fsync that
followed did obviously nasty things to latency. Whereas sorted
checkpoints were able to do sequential I/O with checkpoint spreading
and low latency tuned OS virtual memory settings. I can create a
benchmark that shows this behavior if you need additional data points
to pgbench's OLTP workload to convince you that sorting checkpoint
writes is a good idea.

I did just come up with a case where plain sorting might cause an
issue. If the writes go to different I/O devices then naive sorting
will first use one device then the other, whereas arbitrary writing
will load balance between the devices. Assuming that separate
tablespaces are used for separate I/O devices, it should be enough to
just interleave writes of each tablespace, weighed by the amount of
writes per tablespace.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

From:	Ants Aasma <ants(at)cybertec(dot)at>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-09-04 07:09:30
Message-ID:	CA+CSw_uM-sUbs1cb1Pr--wbrgjuuCm+4qvbozhX2o_+zFLW8bA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> It's imo quite clearly better to keep it allocated. For one after
> postmaster started the checkpointer successfully you don't need to be
> worried about later failures to allocate memory if you allocate it once
> (unless the checkpointer FATALs out which should be exceedingly rare -
> we're catching ERRORs). It's much much more likely to succeed
> initially. Secondly it's not like there's really that much time where no
> checkpointer isn't running.

In principle you could do the sort with the full sized array and then
compress it to a list of buffer IDs that need to be written out. This
way most of the time you only need a small array and the large array
is only needed for a fraction of a second.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Ants Aasma <ants(at)cybertec(dot)at>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-09-04 17:01:03
Message-ID:	CA+TgmoY6uq=bphJeFDrNZUsqtYHr5t0GAR=RefWE3v9Cj_8G1g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 4, 2014 at 3:09 AM, Ants Aasma <ants(at)cybertec(dot)at> wrote:
> On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> It's imo quite clearly better to keep it allocated. For one after
>> postmaster started the checkpointer successfully you don't need to be
>> worried about later failures to allocate memory if you allocate it once
>> (unless the checkpointer FATALs out which should be exceedingly rare -
>> we're catching ERRORs). It's much much more likely to succeed
>> initially. Secondly it's not like there's really that much time where no
>> checkpointer isn't running.
>
> In principle you could do the sort with the full sized array and then
> compress it to a list of buffer IDs that need to be written out. This
> way most of the time you only need a small array and the large array
> is only needed for a fraction of a second.

It's not the size of the array that's the problem; it's the size of
the detonation when the allocation fails.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	didier <did447(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Ants Aasma <ants(at)cybertec(dot)at>, Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-09-05 08:17:29
Message-ID:	CAJRYxuK1oX_u6fXXyaway+iQB8hXeOLN1qw_ohaLksV2Qtm4PQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 4, 2014 at 7:01 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Sep 4, 2014 at 3:09 AM, Ants Aasma <ants(at)cybertec(dot)at> wrote:
>> On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>> It's imo quite clearly better to keep it allocated. For one after
>>> postmaster started the checkpointer successfully you don't need to be
>>> worried about later failures to allocate memory if you allocate it once
>>> (unless the checkpointer FATALs out which should be exceedingly rare -
>>> we're catching ERRORs). It's much much more likely to succeed
>>> initially. Secondly it's not like there's really that much time where no
>>> checkpointer isn't running.
>>
>> In principle you could do the sort with the full sized array and then
>> compress it to a list of buffer IDs that need to be written out. This
>> way most of the time you only need a small array and the large array
>> is only needed for a fraction of a second.
>
> It's not the size of the array that's the problem; it's the size of
> the detonation when the allocation fails.
>
You can use a file backed memory array
Or because it's only a hint and
- keys are in buffers (BufferTag), right?
- transition is only from 'data I care to data I don't care' if a
buffer is concurrently evicted when sorting.

Use a pre allocate buffer index array an read keys from buffers when
sorting, without memory barrier, spinlocks, whatever.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	didier <did447(at)gmail(dot)com>
Cc:	Ants Aasma <ants(at)cybertec(dot)at>, Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-09-05 18:09:39
Message-ID:	CA+Tgmoa-C9fhAM9VynczXqhYmH40t6kBzid7hkbr_ZBrSQXn5A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 5, 2014 at 4:17 AM, didier <did447(at)gmail(dot)com> wrote:
>> It's not the size of the array that's the problem; it's the size of
>> the detonation when the allocation fails.
>>
> You can use a file backed memory array
> Or because it's only a hint and
> - keys are in buffers (BufferTag), right?
> - transition is only from 'data I care to data I don't care' if a
> buffer is concurrently evicted when sorting.
>
> Use a pre allocate buffer index array an read keys from buffers when
> sorting, without memory barrier, spinlocks, whatever.

If by a file-backed memory array you mean mmap() something, that
doesn't do anything to address the possibility of failure. mmap(),
like malloc(), can fail.

But I think we don't really need to have this argument. Andres's
proposal to put this in the main shared memory segment sounds fine to
me. If allocating that fails, you'll know to reduce shared_buffers
and try again. If it succeeds, you should be safe after that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	didier <did447(at)gmail(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: postgresql latency & bgwriter not doing its job
Date:	2014-09-05 19:18:47
Message-ID:	CAGTBQpYfD6-vtj=Nni4NnDz_6qGoCR6KOWkkPJCovepivkMphg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 5, 2014 at 3:09 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Sep 5, 2014 at 4:17 AM, didier <did447(at)gmail(dot)com> wrote:
>>> It's not the size of the array that's the problem; it's the size of
>>> the detonation when the allocation fails.
>>>
>> You can use a file backed memory array
>> Or because it's only a hint and
>> - keys are in buffers (BufferTag), right?
>> - transition is only from 'data I care to data I don't care' if a
>> buffer is concurrently evicted when sorting.
>>
>> Use a pre allocate buffer index array an read keys from buffers when
>> sorting, without memory barrier, spinlocks, whatever.
>
> If by a file-backed memory array you mean mmap() something, that
> doesn't do anything to address the possibility of failure. mmap(),
> like malloc(), can fail.

It's worse. If your process needs to page from disk when touching a
byte in the mmap'd array, and the disk somehow fails to work, you get
a segmentation fault.