Re: Design proposal: fsync absorb linear slider

Lists: pgsql-hackers
From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Design proposal: fsync absorb linear slider
Date: 2013-07-23 03:48:48
Message-ID: 51EDFD20.5060904@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Recently I've been dismissing a lot of suggested changes to checkpoint
fsync timing without suggesting an alternative. I have a simple one in
mind that captures the biggest problem I see: that the number of
backend and checkpoint writes to a file are not connected at all.

We know that a 1GB relation segment can take a really long time to write
out. That could include up to 128 changed 8K pages, and we allow all of
them to get dirty before any are forced to disk with fsync.

Rather than second guess the I/O scheduling, I'd like to take this on
directly by recognizing that the size of the problem is proportional to
the number of writes to a segment. If you turned off fsync absorption
altogether, you'd be at an extreme that allows only 1 write before
fsync. That's low latency for each write, but terrible throughput. The
maximum throughput case of 128 writes has the terrible latency we get
reports about. But what if that trade-off was just a straight, linear
slider going from 1 to 128? Just move it to the latency vs. throughput
position you want, and see how that works out.

The implementation I had in mind was this one:

-Add an absorption_count to the fsync queue.

-Add a new latency vs. throughput GUC I'll call . Its default value is
-1 (or 0), which corresponds to ignoring this new behavior.

-Whenever the background write absorbs a fsync call for a relation
that's already in the queue, increment the absorption count.

-max_segment_absorb > 0, have the background writer scan for relations
where absorption_count > max_segment_absorb. When it finds one, call
fsync on that segment.

Note that it's possible for this simple scheme to be fooled when writes
are actually touching a small number of pages. A process that
constantly overwrites the same page is the worst case here. Overwrite
it 128 times, and this method would assume you've dirtied every page,
while only 1 will actually go to disk when you call fsync. It's
possible to track this better. The count mechanism could be replaced
with a bitmap of the 128 blocks, so that absorbs set a bit instead of
incrementing a count. My gut feel is that this is more complexity than
is really necessary here. If in fact the fsync is slimmer than
expected, paying for it too much isn't the worst problem to have here.

I'd like to build this myself, but if someone else wants to take a shot
at it I won't mind. Just be aware the review is the big part here. I
should be honest about one thing: I have zero incentive to actually work
on this. The moderate amount of sponsorship money I've raised for 9.4
so far isn't getting anywhere near this work. The checkpoint patch
review I have been doing recently is coming out of my weekend volunteer
time.

And I can't get too excited about making this as my volunteer effort
when I consider what the resulting credit will look like. Coding is by
far the smallest part of work like this, first behind coming up with the
design in the first place. And both of those are way, way behind how
long review benchmarking takes on something like this. The way credit
is distributed for this sort of feature puts coding first, design not
credited at all, and maybe you'll see some small review credit for
benchmarks. That's completely backwards from the actual work ratio. If
all I'm getting out of something is credit, I'd at least like it to be
an appropriate amount of it.
--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-23 07:07:15
Message-ID: CAM3SWZSOu4ao8wNfPyggcEO=66K7pUFaCXbbqZirs-n-gZqt2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 22, 2013 at 8:48 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> And I can't get too excited about making this as my volunteer effort when I
> consider what the resulting credit will look like. Coding is by far the
> smallest part of work like this, first behind coming up with the design in
> the first place. And both of those are way, way behind how long review
> benchmarking takes on something like this. The way credit is distributed
> for this sort of feature puts coding first, design not credited at all, and
> maybe you'll see some small review credit for benchmarks. That's completely
> backwards from the actual work ratio. If all I'm getting out of something
> is credit, I'd at least like it to be an appropriate amount of it.

FWIW, I think that's a reasonable request.

--
Peter Geoghegan


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-23 14:56:29
Message-ID: CA+TgmoZbbdOa5geGwbGjQK6OinD3Tx-YLsbCWryD2VXDFY9wgQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Recently I've been dismissing a lot of suggested changes to checkpoint fsync
> timing without suggesting an alternative. I have a simple one in mind that
> captures the biggest problem I see: that the number of backend and
> checkpoint writes to a file are not connected at all.
>
> We know that a 1GB relation segment can take a really long time to write
> out. That could include up to 128 changed 8K pages, and we allow all of
> them to get dirty before any are forced to disk with fsync.

By my count, it can include up to 131,072 changed 8K pages.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-23 16:13:57
Message-ID: 51EEABC5.8050901@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/23/13 10:56 AM, Robert Haas wrote:
> On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>> We know that a 1GB relation segment can take a really long time to write
>> out. That could include up to 128 changed 8K pages, and we allow all of
>> them to get dirty before any are forced to disk with fsync.
>
> By my count, it can include up to 131,072 changed 8K pages.

Even better! I can pinpoint exactly what time last night I got tired
enough to start making trivial mistakes. Everywhere I said 128 it's
actually 131,072, which just changes the range of the GUC I proposed.

Getting the number right really highlights just how bad the current
situation is. Would you expect the database to dump up to 128K writes
into a file and then have low latency when it's flushed to disk with
fsync? Of course not. But that's the job the checkpointer process is
trying to do right now. And it's doing it blind--it has no idea how
many dirty pages might have accumulated before it started.

I'm not exactly sure how best to use the information collected. fsync
every N writes is one approach. Another is to use accumulated writes to
predict how long fsync on that relation should take. Whenever I tried
to spread fsync calls out before, the scale of the piled up writes from
backends was the input I really wanted available. The segment write
count gives an alternate way to sort the blocks too, you might start
with the heaviest hit ones.

In all these cases, the fundamental I keep coming back to is wanting to
cue off past write statistics. If you want to predict relative I/O
delay times with any hope of accuracy, you have to start the checkpoint
knowing something about the backend and background writer activity since
the last one.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-24 17:03:36
Message-ID: CA+TgmoavXUuiWAowqryoMcmNSPWeO=5eyvcYT9bzHqp4=2-RdA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 23, 2013 at 12:13 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 7/23/13 10:56 AM, Robert Haas wrote:
>> On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>>>
>>> We know that a 1GB relation segment can take a really long time to write
>>> out. That could include up to 128 changed 8K pages, and we allow all of
>>> them to get dirty before any are forced to disk with fsync.
>>
>> By my count, it can include up to 131,072 changed 8K pages.
>
> Even better! I can pinpoint exactly what time last night I got tired enough
> to start making trivial mistakes. Everywhere I said 128 it's actually
> 131,072, which just changes the range of the GUC I proposed.
>
> Getting the number right really highlights just how bad the current
> situation is. Would you expect the database to dump up to 128K writes into
> a file and then have low latency when it's flushed to disk with fsync? Of
> course not. But that's the job the checkpointer process is trying to do
> right now. And it's doing it blind--it has no idea how many dirty pages
> might have accumulated before it started.
>
> I'm not exactly sure how best to use the information collected. fsync every
> N writes is one approach. Another is to use accumulated writes to predict
> how long fsync on that relation should take. Whenever I tried to spread
> fsync calls out before, the scale of the piled up writes from backends was
> the input I really wanted available. The segment write count gives an
> alternate way to sort the blocks too, you might start with the heaviest hit
> ones.
>
> In all these cases, the fundamental I keep coming back to is wanting to cue
> off past write statistics. If you want to predict relative I/O delay times
> with any hope of accuracy, you have to start the checkpoint knowing
> something about the backend and background writer activity since the last
> one.

So, I don't think this is a bad idea; in fact, I think it'd be a good
thing to explore. The hard part is likely to be convincing ourselves
of anything about how well or poorly it works on arbitrary hardware
under arbitrary workloads, but we've got to keep trying things until
we find something that works well, so why not this?

One general observation is that there are two bad things that happen
when we checkpoint. One is that we force all of the data in RAM out
to disk, and the other is that we start doing lots of FPIs. Both of
these things harm throughput. Your proposal allows the user to make
the first of those behaviors more frequent without making the second
one more frequent. That idea seems promising, and it also seems to
admit of many variations. For example, instead of issuing an fsync
when after N OS writes to a particular file, we could fsync the file
with the most writes every K seconds. That way, if the system has
busy and idle periods, we'll effectively "catch up on our fsyncs" when
the system isn't that busy, and we won't bunch them up too much if
there's a sudden surge of activity.

Now that's just a shot in the dark and there might be reasons why it's
terrible, but I just generally offer it as food for thought that the
triggering event for the extra fsyncs could be chosen via a multitude
of different algorithms, and as you hack through this it might be
worth trying a few different possibilities.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: didier <did447(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-25 22:02:48
Message-ID: CAJRYxu++4DGwwXdyvNKpZ6_TXQXcy1EL=46Ck3sz0HG86H_ELQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi

On Tue, Jul 23, 2013 at 5:48 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:

> Recently I've been dismissing a lot of suggested changes to checkpoint
> fsync timing without suggesting an alternative. I have a simple one in
> mind that captures the biggest problem I see: that the number of backend
> and checkpoint writes to a file are not connected at all.
>
> We know that a 1GB relation segment can take a really long time to write
> out. That could include up to 128 changed 8K pages, and we allow all of
> them to get dirty before any are forced to disk with fsync.
>
> It was surely already discussed but why isn't postresql writing
sequentially its cache in a temporary file? With storage random speed at
least five to ten time slower it could help a lot.
Thanks

Didier


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: didier <did447(at)gmail(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-25 23:53:01
Message-ID: CA+Tgmoa-06jCUrAaOD2z3SXoPBgJWBBQq8dMPb9fK4Ch9MZtqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 25, 2013 at 6:02 PM, didier <did447(at)gmail(dot)com> wrote:
> It was surely already discussed but why isn't postresql writing
> sequentially its cache in a temporary file? With storage random speed at
> least five to ten time slower it could help a lot.
> Thanks

Sure, that's what the WAL does. But you still have to checkpoint eventually.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: didier <did447(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-26 05:45:02
Message-ID: CAJRYxuJoBrqZ4DCx2j7NCn860zva5o5bGvFqP0aL07VN_=jCcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

> Sure, that's what the WAL does. But you still have to checkpoint
> eventually.
>
> Sure, when you run pg_ctl stop.
Unlike the WAL it only needs two files, shared_buffers size.

I did bogus tests by replacing mask |= BM_PERMANENT; with mask = -1 in
BufferSync() and simulating checkpoint with a periodic dd if=/dev/zero
of=foo conv=fsync

On a saturated storage with %usage locked solid at 100% I got up to 30%
speed improvement and fsync latency down by one order of magnitude, some
fsync were still slow of course if buffers were already in OS cache.

But it's the upper bound, it's was done one a slow storage with bad ratios
: (OS cache write)/(disk sequential write) in 50, (sequential
write)/(effective random write) in 10 range and a proper implementation
would have a 'little' more work to do... (only checkpoint task can write
BM_CHECKPOINT_NEEDED buffers keeping them dirty and so on)

Didier


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: didier <did447(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-26 09:42:34
Message-ID: 51F2448A.6020006@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/25/13 6:02 PM, didier wrote:
> It was surely already discussed but why isn't postresql writing
> sequentially its cache in a temporary file?

If you do that, reads of the data will have to traverse that temporary
file to assemble their data. You'll make every later reader pay the
random I/O penalty that's being avoided right now. Checkpoints are
already postponing these random writes as long as possible. You have to
take care of them eventually though.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: didier <did447(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-26 09:59:53
Message-ID: 51F24899.3080700@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 07/26/2013 11:42 AM, Greg Smith wrote:
> On 7/25/13 6:02 PM, didier wrote:
>> It was surely already discussed but why isn't postresql writing
>> sequentially its cache in a temporary file?
>
> If you do that, reads of the data will have to traverse that temporary
> file to assemble their data. You'll make every later reader pay the
> random I/O penalty that's being avoided right now. Checkpoints are
> already postponing these random writes as long as possible. You have
> to take care of them eventually though.
>
Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
"random"
fs pages on one large disk page and having an extra index layer for
resolving
random-to-sequential ordering.

I would not dismiss the idea without more tests and discussion.

We could have a system where "checkpoint" does sequential writes of dirty
wal buffers to alternating synced holding files (a "checkpoint log" :) )
and only background writer does random writes with no forced sync at all

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ


From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: didier <did447(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-26 10:02:20
Message-ID: 51F2492C.9090706@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 07/26/2013 11:42 AM, Greg Smith wrote:
> On 7/25/13 6:02 PM, didier wrote:
>> It was surely already discussed but why isn't postresql writing
>> sequentially its cache in a temporary file?
>
> If you do that, reads of the data will have to traverse that temporary
> file to assemble their data.
In case of crash recovery, a sequential reading of this file could be
performed as first step.

this should work fairly well in most cases, at least when the recovery
shared_buffers is not smaller
than the latest run of checkpoint-written dirty buffers.

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
Cc: didier <did447(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-26 11:01:59
Message-ID: 51F25727.7040404@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/26/13 5:59 AM, Hannu Krosing wrote:
> Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
> "random"
> fs pages on one large disk page and having an extra index layer for
> resolving
> random-to-sequential ordering.

If your solution to avoiding random writes now is to do sequential ones
into a buffer, you'll pay for it by having more expensive random reads
later. In the SSD write buffer case, that works only because those
random reads are very cheap. Do the same thing on a regular drive, and
you'll be paying a painful penalty *every* time you read in return for
saving work *once* when you write. That only makes sense when your
workload is near write-only.

It's possible to improve on this situation by having some sort of
background process that goes back and cleans up the random data,
converting it back into sequentially ordered writes again. SSD
controllers also have firmware that does this sort of work, and Postgres
might do it as part of vacuum cleanup. But note that such work faces
exactly the same problems as writing the data out in the first place.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, didier <did447(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-26 12:32:01
Message-ID: 13245.1374841921@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <greg(at)2ndQuadrant(dot)com> writes:
> On 7/26/13 5:59 AM, Hannu Krosing wrote:
>> Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
>> "random"
>> fs pages on one large disk page and having an extra index layer for
>> resolving
>> random-to-sequential ordering.

> If your solution to avoiding random writes now is to do sequential ones
> into a buffer, you'll pay for it by having more expensive random reads
> later.

What I'd point out is that that is exactly what WAL does for us, ie
convert a bunch of random writes into sequential writes. But sooner or
later you have to put the data where it belongs.

regards, tom lane


From: didier <did447(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-26 13:14:06
Message-ID: CAJRYxuL4nTF=uSsysE6cveVVHrJhJDfq4gnrTCYWZbO504o7yg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Fri, Jul 26, 2013 at 11:42 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:

> On 7/25/13 6:02 PM, didier wrote:
>
>> It was surely already discussed but why isn't postresql writing
>> sequentially its cache in a temporary file?
>>
>
> If you do that, reads of the data will have to traverse that temporary
> file to assemble their data. You'll make every later reader pay the random
> I/O penalty that's being avoided right now. Checkpoints are already
> postponing these random writes as long as possible. You have to take care
> of them eventually though.
>
>
> No the log file is only used at recovery time.

in check point code:
- loop over cache, marks dirty buffers with BM_CHECKPOINT_NEEDED as in
current code
- other workers can't write and evicted these marked buffers to disk,
there's a race with fsync.
- check point fsync now or after the next step.
- check point loop again save to log these buffers, clear
BM_CHECKPOINT_NEEDED but *doesn't* clear BM_DIRTY, of course many buffers
will be written again, as they are when check point isn't running.
- check point done.

During recovery you have to load the log in cache first before applying WAL.

Didier


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: didier <did447(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-26 13:41:16
Message-ID: 51F27C7C.3000209@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/26/13 9:14 AM, didier wrote:
> During recovery you have to load the log in cache first before applying WAL.

Checkpoints exist to bound recovery time after a crash. That is their
only purpose. What you're suggesting moves a lot of work into the
recovery path, which will slow down how long it takes to process.

More work at recovery time means someone who uses the default of
checkpoint_timeout='5 minutes', expecting that crash recovery won't take
very long, will discover it does take a longer time now. They'll be
forced to shrink the value to get the same recovery time as they do
currently. You might need to make checkpoint_timeout 3 minutes instead,
if crash recovery now has all this extra work to deal with. And when
the time between checkpoints drops, it will slow the fundamental
efficiency of checkpoint processing down. You will end up writing out
more data in the end.

The interval between checkpoints and recovery time are all related. If
you let any one side of the current requirements slip, it makes the rest
easier to deal with. Those are all trade-offs though, not improvements.
And this particular one is already an option.

If you want less checkpoint I/O per capita and don't care about recovery
time, you don't need a code change to get it. Just make
checkpoint_timeout huge. A lot of checkpoint I/O issues go away if you
only do a checkpoint per hour, because instead of random writes you're
getting sequential ones to the WAL. But when you crash, expect to be
down for a significant chunk of an hour, as you go back to sort out all
of the work postponed before.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, didier <did447(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-26 13:52:45
Message-ID: 51F27F2D.60809@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/26/13 8:32 AM, Tom Lane wrote:
> What I'd point out is that that is exactly what WAL does for us, ie
> convert a bunch of random writes into sequential writes. But sooner or
> later you have to put the data where it belongs.

Hannu was observing that SSD often doesn't do that at all. They can
maintain logical -> physical translation tables that decode where each
block was written to forever. When read seeks are really inexpensive,
the only pressure to reorder block is wear leveling.

That doesn't really help with regular drives though, where the low seek
time assumption doesn't play out so well. The whole idea of writing
things sequentially and then sorting them out later was the rage in 2001
for ext3 on Linux, as part of the "data=journal" mount option. You can
go back and see that people are confused but excited about the
performance at
http://www.ibm.com/developerworks/linux/library/l-fs8/index.html

Spoiler: if you use a workload that has checkpoint issues, it doesn't
help PostgreSQL latency. Just like using a large write cache, you gain
some burst performance, but eventually you pay for it with extra latency
somewhere.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: didier <did447(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-26 15:11:06
Message-ID: CAJRYxuJFBJQcCAd2-0kfWecY=3T5qqDWnMaZ2vf3_g0HAMXpJg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Fri, Jul 26, 2013 at 3:41 PM, Greg Smith
<greg(at)2ndquad<greg(at)2ndquadrant(dot)com>(needrant.com<greg(at)2ndquadrant(dot)com>
> wrote:

> On 7/26/13 9:14 AM, didier wrote:
>
>> During recovery you have to load the log in cache first before applying
>> WAL.
>>
>
> Checkpoints exist to bound recovery time after a crash. That is their
> only purpose. What you're suggesting moves a lot of work into the recovery
> path, which will slow down how long it takes to process.
>
> Yes it's slower but you're sequentially reading only one file at most the
size of your buffer cache, moreover it's a constant time.

Let say you make a checkpoint and crash just after with a next to empty
WAL.

Now recovery is very fast but you have to repopulate your cache with
random reads from requests.

With the snapshot it's slower but you read, sequentially again, a lot of
hot cache you will need later when the db starts serving requests.

Of course the worst case is if it crashes just before a checkpoint, most of
the snapshot data are stalled and will be overwritten by WAL ops.

But If the WAL recovery is CPU bound, loading from the snapshot may be
done concurrently while replaying the WAL.

More work at recovery time means someone who uses the default of
> checkpoint_timeout='5 minutes', expecting that crash recovery won't take
> very long, will discover it does take a longer time now. They'll be forced
> to shrink the value to get the same recovery time as they do currently.
> You might need to make checkpoint_timeout 3 minutes instead, if crash
> recovery now has all this extra work to deal with. And when the time
> between checkpoints drops, it will slow the fundamental efficiency of
> checkpoint processing down. You will end up writing out more data in the
> end.
>
Yes it's a trade off, now you're paying the price at checkpoint time, every
time, with the log you're paying only once, at recovery.

>
> The interval between checkpoints and recovery time are all related. If
> you let any one side of the current requirements slip, it makes the rest
> easier to deal with. Those are all trade-offs though, not improvements.
> And this particular one is already an option.
>
> If you want less checkpoint I/O per capita and don't care about recovery
> time, you don't need a code change to get it. Just make checkpoint_timeout
> huge. A lot of checkpoint I/O issues go away if you only do a checkpoint
> per hour, because instead of random writes you're getting sequential ones
> to the WAL. But when you crash, expect to be down for a significant chunk
> of an hour, as you go back to sort out all of the work postponed before.

It's not the same it's a snapshot saved and loaded in constant time unlike
the WAL log.

Didier


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-07-29 06:04:56
Message-ID: 51F60608.3020902@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/07/24 1:13), Greg Smith wrote:
> On 7/23/13 10:56 AM, Robert Haas wrote:
>> On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>>> We know that a 1GB relation segment can take a really long time to write
>>> out. That could include up to 128 changed 8K pages, and we allow all of
>>> them to get dirty before any are forced to disk with fsync.
>>
>> By my count, it can include up to 131,072 changed 8K pages.
>
> Even better! I can pinpoint exactly what time last night I got tired enough to
> start making trivial mistakes. Everywhere I said 128 it's actually 131,072,
> which just changes the range of the GUC I proposed.
I think that it is almost same as small dirty_background_ratio or
dirty_background_bytes.
This method will be very bad performance, and many fsync() may be caused long fsync
situaition which was said past by you. My colleagues who are kernel expert say,
in executing fsync(), other process write same file a lot, it does not return fsync
call function occasionally. So too many fsync with large file is very dangerous.
Moreover fsync() also write metadata, it is worst for performance.

The essential improvement is not dirty page size in fsync() but scheduling of
fsync phase.
I can't understand why postgres does not consider scheduling of fsync phase. When
dirty_background_ratio is big, in write phase does not write to disk at all,
therefore, fsync() is too heavy in fsync phase.

> Getting the number right really highlights just how bad the current situation
> is. Would you expect the database to dump up to 128K writes into a file and then
> have low latency when it's flushed to disk with fsync? Of course not.
I think that it will be improved this problem by sync_file_range() in fsync phase,
and adding checkpoint schedule in fsync phase. Executing small range
sync_file_range()
and sleep, in final executing fsync(). I think it is better than your proposal.
If a system do not support sync_file_range() system call, it only execute fsync
and sleep, it is same our method (you and I posted past).

Taken together my checkpoint proposal method,

* write phase
- Almost same, but considering fsync phase schedule.
- Considering case of background-write in OS, sort buffer before starting
checkpoint write.

* fsync phase
- Considering checkpoint schedule and write-phase schedule
- Executing separated sync_file_range() and sleep, in final fsync().

And if I can, not write a buffer method which is called fsync() in a target file.
I think it may be quite difficult.

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


From: Jim Nasby <jim(at)nasby(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Smith <greg(at)2ndQuadrant(dot)com>, Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, didier <did447(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-08-21 22:31:34
Message-ID: 52153FC6.4030004@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/26/13 7:32 AM, Tom Lane wrote:
> Greg Smith <greg(at)2ndQuadrant(dot)com> writes:
>> On 7/26/13 5:59 AM, Hannu Krosing wrote:
>>> Well, SSD disks do it in the way proposed by didier (AFAIK), by putting
>>> "random"
>>> fs pages on one large disk page and having an extra index layer for
>>> resolving
>>> random-to-sequential ordering.
>
>> If your solution to avoiding random writes now is to do sequential ones
>> into a buffer, you'll pay for it by having more expensive random reads
>> later.
>
> What I'd point out is that that is exactly what WAL does for us, ie
> convert a bunch of random writes into sequential writes. But sooner or
> later you have to put the data where it belongs.

FWIW, at RICon East there was someone from Seagate that gave a presentation. One of his points is that even spinning rust is moving to the point where the drive itself has to do some kind of write log. He notes that modern filesystems do the same thing, and the overlap is probably stupid (I pointed out that the most degenerate case is the logging database on the logging filesystem on the logging drive...)

It'd be interesting for Postgres to work with drive manufacturers to study ways to get rid of the extra layers of stupid...
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design proposal: fsync absorb linear slider
Date: 2013-08-27 10:26:30
Message-ID: 521C7ED6.3010108@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/29/13 2:04 AM, KONDO Mitsumasa wrote:
> I think that it is almost same as small dirty_background_ratio or
> dirty_background_bytes.

The main difference here is that all writes pushed out this way will be
to a single 1GB relation chunk. The odds are better that multiple
writes will combine, and that the I/O will involve a lower than average
amount of random seeking. Whereas shrinking the size of the write cache
always results in more random seeking.

> The essential improvement is not dirty page size in fsync() but
> scheduling of fsync phase.
> I can't understand why postgres does not consider scheduling of fsync
> phase.

Because it cannot get the sort of latency improvements I think people
want. I proved to myself it's impossible during the last 9.2 CF when I
submitted several fsync scheduling change submissions.

By the time you get to the fsync sync phase, on a system that's always
writing heavily there is way too much backlog to possibly cope with by
then. There just isn't enough time left before the checkpoint should
end to write everything out. You have to force writes to actual disk to
start happening earlier to keep a predictable schedule. Basically, the
longer you go without issuing a fsync, the more uncertainty there is
around how long it might take to fire. My proposal lets someone keep
all I/O from ever reaching the point where the uncertainty is that high.

In the simplest to explain case, imagine that a checkpoint includes a
1GB relation segment that is completely dirty in shared_buffers. When a
checkpoint hits this, it will have 1GB of I/O to push out.

If you have waited this long to fsync the segment, the problem is now
too big to fix by checkpoint time. Even if the 1GB of writes are
themselves nicely ordered and grouped on disk, the concurrent background
ability is going to chop the combination up into more random I/O than
the ideal.

Regular consumer disks have a worst case random I/O throughput of less
than 2MB/s. My observed progress rates for such systems show you're
lucky to get 10MB/s of writes out of them. So how long will the dirty
1GB in the segment take to write? 1GB @ 10MB/s = 102.4 *seconds*. And
that's exactly what I saw whenever I tried to play with checkpoint sync
scheduling. No matter what you do there, periodically you'll hit a
segment that has over a minute of dirty data accumulated, and >60 second
latency pauses result. By the point you've reached checkpoint, you're
dead when you call fsync on that relation. You *must* hit that segment
with fsync more often than once per checkpoint to achieve reasonable
latency.

With this "linear slider" idea, I might tune such that no segment will
ever get more than 256MB of writes before hitting a fsync instead. I
can't guarantee that will work usefully, but the shape of the idea seems
to match the problem.

> Taken together my checkpoint proposal method,
> * write phase
> - Almost same, but considering fsync phase schedule.
> - Considering case of background-write in OS, sort buffer before
> starting checkpoint write.

This cannot work for the reasons I've outlined here. I guarantee you I
will easily find a test workload where it performs worse than what's
happening right now. If you want to play with this to learn more about
the trade-offs involved, that's fine, but expect me to vote against
accepting any change of this form. I would prefer you to not submit
them because it will waste a large amount of reviewer time to reach that
conclusion yet again. And I'm not going to be that reviewer.

> * fsync phase
> - Considering checkpoint schedule and write-phase schedule
> - Executing separated sync_file_range() and sleep, in final fsync().

If you can figure out how to use sync_file_range() to fine tune how much
fsync is happening at any time, that would be useful on all the
platforms that support it. I haven't tried it just because that looked
to me like a large job refactoring the entire fsync absorb mechanism,
and I've never had enough funding to take it on. That approach has a
lot of good properties, if it could be made to work without a lot of
code changes.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com