Re: Group Commits Vs WAL Writes

Lists: pgsql-hackers
From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Group Commits Vs WAL Writes
Date: 2013-06-27 07:56:59
Message-ID: CAOeZVidRgO9rgWA_xw1MfANX=vn6e-w0jNS6m_7kTLyc4gRbPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

I think this is a naive question.

When we do a commit, WAL buffers are written to the disk. This has a
disk latency for the required I/O.

Now, with group commits, do we see a spike in that disk write latency,
especially in the cases where the user has set wal_buffers to a high
value?

Regards,

Atri

--
Regards,

Atri
l'apprenant


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Group Commits Vs WAL Writes
Date: 2013-06-27 15:30:56
Message-ID: CA+TgmobGoD5igUKpafiizjQyQZXeG2o_wcShzt8QiRjGOQS6Tg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 27, 2013 at 3:56 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
> When we do a commit, WAL buffers are written to the disk. This has a
> disk latency for the required I/O.

Check.

> Now, with group commits, do we see a spike in that disk write latency,
> especially in the cases where the user has set wal_buffers to a high
> value?

Well, it does take longer to fsync a larger byte range to disk than a
smaller byte range, in some cases. But it's generally more efficient
to write one larger range than many smaller ranges, so you come out
ahead on the whole.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Group Commits Vs WAL Writes
Date: 2013-06-27 16:35:33
Message-ID: CAM3SWZS0AUQTeCWxuboGigXHur-adC_mGRTJGp_nFsjbEt3O4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 27, 2013 at 12:56 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
> Now, with group commits, do we see a spike in that disk write latency,
> especially in the cases where the user has set wal_buffers to a high
> value?

commit_delay exists to artificially increase the window in which the
leader backend waits for more group commit followers. At higher client
counts, that isn't terribly useful because you'll naturally have
enough clients anyway, but at lower client counts particularly where
fsyncs have high latency, it can help quite a bit. I mention this
because clearly commit_delay is intended to trade off latency for
throughput. Although having said that, when I worked on commit_delay,
the average and worse-case latencies actually *improved* for the
workload in question, which consisted of lots of small write
transactions. Though I wouldn't be surprised if you could produce a
reasonable case where latency was hurt a bit, but throughput improved.

--
Peter Geoghegan


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Group Commits Vs WAL Writes
Date: 2013-06-27 16:51:43
Message-ID: CAOeZVifktS0L+UHJj0WoAa_LCTKr3AYe+7P5r2mazG+JPsuB8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> commit_delay exists to artificially increase the window in which the
> leader backend waits for more group commit followers. At higher client
> counts, that isn't terribly useful because you'll naturally have
> enough clients anyway, but at lower client counts particularly where
> fsyncs have high latency, it can help quite a bit. I mention this
> because clearly commit_delay is intended to trade off latency for
> throughput. Although having said that, when I worked on commit_delay,
> the average and worse-case latencies actually *improved* for the
> workload in question, which consisted of lots of small write
> transactions. Though I wouldn't be surprised if you could produce a
> reasonable case where latency was hurt a bit, but throughput improved.

Thanks for your reply.

The logic says that latency will be hit when commit_delay is applied,
but I am really interested in why we get an improvement instead.

Can small writes be the reason?

Regards,

Atri

--
Regards,

Atri
l'apprenant


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Group Commits Vs WAL Writes
Date: 2013-06-27 16:54:34
Message-ID: CAOeZVidycMg891PwyKDEo_+8SEQwct_guCPzFCBA7AKh_NgQ_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Well, it does take longer to fsync a larger byte range to disk than a
> smaller byte range, in some cases. But it's generally more efficient
> to write one larger range than many smaller ranges, so you come out
> ahead on the whole.

Right, that does make sense.

So, the overhead of writing a lot of WAL buffers is mitigated because
one large write is better than lots of smaller rights?

Regards,

Atri

--
Regards,

Atri
l'apprenant


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Group Commits Vs WAL Writes
Date: 2013-06-27 18:32:55
Message-ID: CA+TgmoZV=kmNH37zNx+CJpxh2MQfEEPfkeTyPCXmi=04qAMqaA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 27, 2013 at 12:54 PM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
>> Well, it does take longer to fsync a larger byte range to disk than a
>> smaller byte range, in some cases. But it's generally more efficient
>> to write one larger range than many smaller ranges, so you come out
>> ahead on the whole.
>
> Right, that does make sense.
>
> So, the overhead of writing a lot of WAL buffers is mitigated because
> one large write is better than lots of smaller rights?

Yep. To take a degenerate case, suppose that you had many small WAL
records, say 64 bytes each, so more than 100 per 8K block. If you
flush those one by one, you're going to rewrite that block 100 times.
If you flush them all at once, you write that block once.

But even when the range is more than the minimum write size (8K for
WAL), there are still wins. Writing 16K or 24K or 32K submitted as a
single request can likely be done in a single revolution of the disk
head. But if you write 8K and wait until it's done, and then write
another 8K and wait until that's done, the second request may not
arrive until after the disk head has passed the position where the
second block needs to go. Now you have to wait for the drive to spin
back around to the right position.

The details of course vary with the hardware in use, but there are
very few I/O operations where batching smaller requests into larger
chunks doesn't help to some degree. Of course, the optimal transfer
size does vary considerably based on the type of I/O and the specific
hardware in use.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Group Commits Vs WAL Writes
Date: 2013-06-27 22:32:42
Message-ID: CAMkU=1zSyAuCP6cQ=9NjX-3nwW00qPcAz93JUnEDWBRacVWtmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 27, 2013 at 9:51 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:

> >
> > commit_delay exists to artificially increase the window in which the
> > leader backend waits for more group commit followers. At higher client
> > counts, that isn't terribly useful because you'll naturally have
> > enough clients anyway, but at lower client counts particularly where
> > fsyncs have high latency, it can help quite a bit. I mention this
> > because clearly commit_delay is intended to trade off latency for
> > throughput. Although having said that, when I worked on commit_delay,
> > the average and worse-case latencies actually *improved* for the
> > workload in question, which consisted of lots of small write
> > transactions. Though I wouldn't be surprised if you could produce a
> > reasonable case where latency was hurt a bit, but throughput improved.
>

Throughput and average latency are strictly reciprocal, aren't they? I
think when people talk about improving latency, they must mean something
like "improve 95% latency", not average latency. Otherwise, it doesn't
seem to make much sense to me, they are the same thing.

>
> Thanks for your reply.
>
> The logic says that latency will be hit when commit_delay is applied,
> but I am really interested in why we get an improvement instead.
>

There is a spot on the disk to which the current WAL is destined to go.
That spot on the disk is not going to be under the write-head for (say)
another 6 milliseconds.

Without commit_delay, I try to commit my record, but find that someone else
is already on the lock (and on the fsync as well). I have to wait for 6
milliseconds before that person gets their commit done and releases the
lock, then I can start mine, and have to wait another 8 milliseconds (7500
rpm disk) for the spot to come around again, for a total of 14 milliseconds
of latency.

With commit_delay, I get my record in under the nose of the person who is
already doing the delay, and they wake up and flush it for me in time to
make the 6 millisecond cutoff. Total 6 milliseconds latency for me.

One thing I tried a while ago (before the recent group-commit changes were
made) was to record in shared memory when the last fsync finished, and then
the next time someone needed to fsync, they would sleep until just before
the write spot was predicted to be under the write head again
(previous_finish + rotation_time - safety_margin, where rotation_time -
safety_margin were represented by a single guc). It worked pretty well on
the system in which I wrote it, but seemed too brittle to be a general
solution.

Another thing I tried was to drop the WALWriteLock after the WAL write
finished but before calling fsync. The theory was that process 1 could
write its WAL and then block on the fsync, and then process 2 could also
write its WAL and also block directly on the fsync, and the kernel/disk
controller would be smart enough to realize that it could merge the two
pending fsync requests into one. This did not work at all, possibly
because my disk controller was very cheap and not very smart.

Cheers,

Jeff


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Group Commits Vs WAL Writes
Date: 2013-06-28 09:57:55
Message-ID: CAOeZVieocN+YxpETcxoCQf8ZgJwnu07k8FDTYKfQZLagOtFTKw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> There is a spot on the disk to which the current WAL is destined to go.
> That spot on the disk is not going to be under the write-head for (say)
> another 6 milliseconds.
>
> Without commit_delay, I try to commit my record, but find that someone else
> is already on the lock (and on the fsync as well). I have to wait for 6
> milliseconds before that person gets their commit done and releases the
> lock, then I can start mine, and have to wait another 8 milliseconds (7500
> rpm disk) for the spot to come around again, for a total of 14 milliseconds
> of latency.
>
> With commit_delay, I get my record in under the nose of the person who is
> already doing the delay, and they wake up and flush it for me in time to
> make the 6 millisecond cutoff. Total 6 milliseconds latency for me.

Right. The example makes it very clear. Thanks for such a detailed explanation.

> One thing I tried a while ago (before the recent group-commit changes were
> made) was to record in shared memory when the last fsync finished, and then
> the next time someone needed to fsync, they would sleep until just before
> the write spot was predicted to be under the write head again
> (previous_finish + rotation_time - safety_margin, where rotation_time -
> safety_margin were represented by a single guc). It worked pretty well on
> the system in which I wrote it, but seemed too brittle to be a general
> solution.

Can we look at researching a general formula for the above? It looks a
bit touchy, but could be made to work. Another option is to add a
probabilistic factor in the formula. Idk, it just seems to be a hunch
I have.

Regards,

Atri

--
Regards,

Atri
l'apprenant


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Group Commits Vs WAL Writes
Date: 2013-06-28 10:04:12
Message-ID: CAOeZVicRV3gDyud-AY98sUvUcRHSYiRMTHbey94tPGB07UvC4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> Yep. To take a degenerate case, suppose that you had many small WAL
> records, say 64 bytes each, so more than 100 per 8K block. If you
> flush those one by one, you're going to rewrite that block 100 times.
> If you flush them all at once, you write that block once.
>
> But even when the range is more than the minimum write size (8K for
> WAL), there are still wins. Writing 16K or 24K or 32K submitted as a
> single request can likely be done in a single revolution of the disk
> head. But if you write 8K and wait until it's done, and then write
> another 8K and wait until that's done, the second request may not
> arrive until after the disk head has passed the position where the
> second block needs to go. Now you have to wait for the drive to spin
> back around to the right position.
>
> The details of course vary with the hardware in use, but there are
> very few I/O operations where batching smaller requests into larger
> chunks doesn't help to some degree. Of course, the optimal transfer
> size does vary considerably based on the type of I/O and the specific
> hardware in use.

This makes a lot of sense. I was always under the impression that
batching small requests into larger requests bears the overhead of I/O
latency. But, it seems to be the other way round, which I have now
understood.

Thanks a ton,

Regards,

Atri

--
Regards,

Atri
l'apprenant