Re: checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2016-01-07 20:08:10
Message-ID: alpine.DEB.2.10.1601071613040.5278@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Andres,

>> One of the point of aggregating flushes is that the range flush call cost
>> is significant, as shown by preliminary tests I did, probably up in the
>> thread, so it makes sense to limit this cost, hence the aggregation. These
>> removed some performation regression I had in some cases.
>
> FWIW, my tests show that flushing for clean ranges is pretty cheap.

Yes, I agree that it is quite cheap, but I had a few % tps regressions
in some cases without aggregating, and aggregating was enough to avoid
these small regressions.

>> Also, the granularity of the buffer flush call is a file + offset + size, so
>> necessarily it should be done this way (i.e. per file).
>
> What syscalls we issue, and at what level we track outstanding flushes,
> doesn't have to be the same.

Sure. But the current version is simple, efficient and proven by many
runs, so there should be a very strong argument to justify a significant
benefit to change the approach, and I see no such thing in your arguments.

For me the current approach is optimal for the checkpointer, because it
takes advantage of all available information to perform a better job.

>> Once buffers are sorted per file and offset within file, then written
>> buffers are as close as possible one after the other, the merging is very
>> easy to compute (it is done on the fly, no need to keep the list of buffers
>> for instance), it is optimally effective, and when the checkpointed file
>> changes then we will never go back to it before the next checkpoint, so
>> there is no reason not to flush right then.
>
> Well, that's true if there's only one tablespace, but e.g. not the case
> with two tablespaces of about the same number of dirty buffers.

ISTM that in the version of the patch I sent there was one flushing
structure per tablespace each doing its own flushing on its files, so it
should work the same, only the writing intensity is devided by the number
of tablespace? Or am I missing something?

>> So basically I do not see a clear positive advantage to your suggestion,
>> especially when taking into consideration the scheduling process of the
>> scheduler:
>
> I don't think it makes a big difference for the checkpointer alone, but
> it makes the interface much more suitable for other processes, e.g. the
> bgwriter, and normal backends.

Hmmm.

ISTM that the requirement are not exactly the same for the bgwriter and
backends vs the checkpointer. The checkpointer has the advantage of being
able to plan its IOs on the long term (volume & time is known...) and the
implementation takes the full benefit of this planing by sorting and
scheduling and flushing buffers so as to generate as much sequential
writes as possible.

The bgwriter and backends have a much shorter vision (a few seconds, or
juste one query being process), so the solution will be less efficient and
probably more messy on the coding side. This is life. I do not see why not
to take the benefit of a full planing in the checkpointer just because
other processes cannot do the same, especially as under plenty of loads
the checkpointer does most of the writing so is the limiting factor.

So I do not buy your suggestion for the checkpointer. Maybe it will be the
way to go for bgwriter and backends, then fine for them.

>>> Imo that means that we'd better track writes on a relfilenode + block
>>> number level.
>>
>> I do not think that it is a better option. Moreover, the current approach
>> has been proven to be very effective on hundreds of runs, so redoing it
>> differently for the sake of it does not look like good resource allocation.
>
> For a subset of workloads, yes.

Hmmm. What I understood is that the workloads that have some performance
regressions (regressions that I have *not* seen in the many tests I ran)
are not due to checkpointer IOs, but rather in settings where most of the
writes is done by backends or bgwriter.

I do not see the point of rewriting the checkpointer for them, although
obviously I agree that something has to be done also for the other
processes.

Maybe if all the writes (bgwriter and checkpointer) where performed by the
same process then some dynamic mixing and sorting and aggregating would
make sense, but this is currently not the case, and would probably have
quite limited effect.

Basically I do not understand how changing the flushing organisation as
you suggest would improve the checkpointer performance significantly, for
me it should only degrade the performance compared to the current version,
as far as the checkpointer is concerned.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2016-01-07 20:16:48 Re: Re: [COMMITTERS] pgsql: Windows: Make pg_ctl reliably detect service status
Previous Message Alvaro Herrera 2016-01-07 20:04:05 Re: [COMMITTERS] pgsql: Windows: Make pg_ctl reliably detect service status