Re: postgresql latency & bgwriter not doing its job

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: postgresql latency & bgwriter not doing its job
Date: 2014-08-26 08:34:46
Message-ID: 20140826083446.GG21544@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-08-26 10:25:29 +0200, Fabien COELHO wrote:
> >Did you check whether xfs yields a, err, more predictable performance?
>
> No. I cannot test that easily without reinstalling the box. I did some quick
> tests with ZFS/FreeBSD which seemed to freeze the same, but not in the very
> same conditions. Maybe I could try again.

After Robert and I went to LSF/MM this spring I sent a test program for
precisely this problem and while it could *crash* machines when using
ext4, xfs yielded much more predictable performance. There's a problem
with priorization of write vs read IO that's apparently FS dependent.

> >[...] Note that it would *not* be a good idea to make the bgwriter write
> >out everything, as much as possible - that'd turn sequential write io into
> >random write io.
>
> Hmmm. I'm not sure it would be necessary the case, it depends on how
> bgwriter would choose the pages to write? If they are chosen randomly then
> indeed that could be bad.

The essentially have to be random to fulfil it's roal of reducing the
likelihood of a backend having to write out a buffer itself. Consider
how the clock sweep algorithm (not that I am happy with it) works. When
looking for a new victim buffer all backends scan the buffer cache in
one continuous cycle. If they find a buffer with a usagecount==0 they'll
use that one and throw away its contents. Otherwise they reduce
usagecount by 1 and move on. What the bgwriter *tries* to do is to write
out buffers with usagecount==0 that are dirty and will soon be visited
in the clock cycle. To avoid having the backends to do that.

> If there is a big sequential write, should not the
> backend do the write directly anyway? ISTM that currently checkpoint is
> mostly random writes anyway, at least with the OLTP write load of pgbench.
> I'm just trying to be able to start them ealier so that they can be
> completed quickly.

If the IO scheduling worked - which it really doesn't in many cases -
there'd really be no need to make it finish fast. I think you should try
to tune spread checkpoints to have less impact, not make bgwriter do
something it's not written for.

> So although bgwriter is not the solution, ISTM that pg has no reason to wait
> for minutes before starting to write dirty pages, if it has nothing else to
> do.

That precisely *IS* a spread checkpoint.

> If the OS does some retention later and cannot spread the load, as Josh
> suggest, this could also be a problem, but currently the OS seems not to
> have much to write (but WAL) till the checkpoint.

The actual problem is that the writes by the checkpointer - done in the
background - aren't flushed out eagerly enough out of the OS's page
cache. Then, when the final phase of the checkpoint comes, where
relation files need to be fsynced, some filesystems essentially stal
while trying to write out lots and lots of dirty buffers.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Rukh Meski 2014-08-26 08:35:59 Re: pgbench throttling latency limit
Previous Message Fabien COELHO 2014-08-26 08:25:29 Re: postgresql latency & bgwriter not doing its job