Re: Doc patch making firm recommendation for setting the value of commit_delay

From: Noah Misch <noah(at)leadboat(dot)com>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Doc patch making firm recommendation for setting the value of commit_delay
Date: 2013-01-27 02:31:30
Message-ID: 20130127023130.GB22856@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Peter,

I took a look at this patch and the benchmarks you've furnished:

https://github.com/petergeoghegan/commit_delay_benchmarks
http://pgeoghegan.blogspot.com/2012/06/towards-14000-write-transactions-on-my.html

On Wed, Nov 14, 2012 at 08:44:26PM +0000, Peter Geoghegan wrote:

> Attached is a doc-patch that makes recommendations that are consistent
> with my observations about what works best here. I'd like to see us
> making *some* recommendation - for sympathetic cases, setting
> commit_delay appropriately can make a very large difference to
> transaction throughput. Such sympathetic cases - many small write
> transactions - are something that tends to be seen relatively
> frequently with web applications, that disproportionately use cloud
> hosting. It isn't at all uncommon for these cases to be highly bound
> by their commit rate, and so it is compelling to try to amortize the
> cost of a flush as effectively as possible there. It would be
> unfortunate if no one was even aware that commit_delay is now useful
> for these cases, since the setting allows cloud hosting providers to
> help these cases quite a bit, without having to do something like
> compromise durability, which in general isn't acceptable.

Your fast-fsync (SSD, BBWC) benchmarks show a small loss up to 8 clients and a
10-20% improvement at 32 clients. That's on a 4-core/8-thread CPU, assuming
HT was left enabled. Your slow-fsync (laptop) benchmarks show a 40-100%
improvement in the 16-64 client range.

I did a few more benchmarks along the spectrum. First, I used a Mac, also
4-core/8-thread, with fsync_writethrough; half of fsync time gave commit_delay
= 35000. I used pgbench, scale factor 100, 4-minute runs, three trials each:

-clients- -tps(at)commit_delay=0- -tps(at)commit_delay=35000-
8 51,55,63 82,84,86
16 98,100,107 130,134,143
32 137,148,157 192,200,201
64 199,201,214 249,256,258

So that's a nice 27-53% improvement, fairly similar to the pattern for your
laptop pgbench numbers. Next, based on your comment about the possible value
for cloud-hosted applications, I tried a cc2.8xlarge (16 core, 32 thread),
GNU/Linux EC2 instance with a data directory on a standard EBS volume, ext4
filesystem. Several 15s pg_test_fsync runs could not agree on an fsync time;
I saw results from 694us to 1904us. Ultimately I settled on trying
commit_delay=500, scale factor 300:

-clients- -tps(at)commit_delay=0- -tps(at)commit_delay=500-
32 1224,1391,1584 1175,1229,1394
64 1553,1647,1673 1544,1546,1632
128 1717,1833,1900 1621,1720,1951
256 1664,1717,1918 1734,1832,1918

The numbers are all over the place, but there's more loss than gain. Amit
Kapila also measured small losses in tps(at)-c16:

http://www.postgresql.org/message-id/000701cd6ff0$013a6210$03af2630$@kapila@huawei.com

I was curious about the cost of the MinimumActiveBackends() call when relying
on commit_siblings to skip the delay. I ran a similar test with an extra 500
idle backends, clients=8, commit_siblings=20 (so the delay would never be
used), and either a zero or nonzero commit_delay. There was no appreciable
performance advantage from setting commit_delay=0 as opposed to relying on
commit_siblings to suppress the delay. That's good news.

On the GNU/Linux VM, pg_sleep() achieves precision on the order of 10us.
However, the sleep was consistently around 70us longer than requested. A
300us request yielded a 370us sleep, and a 3000us request gave a 3080us sleep.
Mac OS X was similarly precise for short sleeps, but it could oversleep a full
1000us on a 35000us sleep.

> diff doc/src/sgml/wal.sgml
> index fc5c3b2..92619dd
> *** a/doc/src/sgml/wal.sgml
> --- b/doc/src/sgml/wal.sgml
> ***************
> *** 375,382 ****
> just before a synchronous commit attempts to flush
> <acronym>WAL</acronym> to disk, in the hope that a single flush
> executed by one such transaction can also serve other transactions
> ! committing at about the same time. Setting <varname>commit_delay</varname>
> ! can only help when there are many concurrently committing transactions.
> </para>
>
> </sect1>
> --- 375,397 ----

The beginning of this paragraph stills says "commit_delay causes a delay just
before a synchronous commit attempts to flush WAL to disk". Since it now
applies to every WAL flush, that should be updated.

> ***************
> *** 560,570 ****
> is not enabled, or if fewer than <xref linkend="guc-commit-siblings">
> other sessions are currently in active transactions; this avoids
> sleeping when it's unlikely that any other session will commit soon.
> ! Note that on most platforms, the resolution of a sleep request is
> ten milliseconds, so that any nonzero <varname>commit_delay</varname>
> setting between 1 and 10000 microseconds would have the same effect.
> ! Good values for these parameters are not yet clear; experimentation
> ! is encouraged.
> </para>
>
> <para>
> --- 575,607 ----

There's a similar problem at the beginning of this paragraph; it says
specifically, "The commit_delay parameter defines for how many microseconds
the server process will sleep after writing a commit record to the log with
LogInsert but before performing a LogFlush."

> is not enabled, or if fewer than <xref linkend="guc-commit-siblings">
> other sessions are currently in active transactions; this avoids
> sleeping when it's unlikely that any other session will commit soon.
> ! Note that on some platforms, the resolution of a sleep request is
> ten milliseconds, so that any nonzero <varname>commit_delay</varname>
> setting between 1 and 10000 microseconds would have the same effect.
> ! </para>

As a side note, if we're ever going to recommend a fire-and-forget method for
setting commit_delay, it may be worth detecting whether the host sleep
granularity is limited like this. Setting commit_delay = 20 for your SSD and
silently getting commit_delay = 10000 would make for an unpleasant surprise.

> !
> ! <para>
> ! Since the purpose of <varname>commit_delay</varname> is to allow
> ! the cost of each flush operation to be more effectively amortized
> ! across concurrently committing transactions (potentially at the
> ! expense of transaction latency), it is necessary to quantify that
> ! cost when altering the setting. The higher that cost is, the more
> ! effective <varname>commit_delay</varname> is expected to be in
> ! increasing transaction throughput. The

That's true for spinning disks, but I suspect it does not hold for storage
with internal parallelism, notably virtualized storage. Consider an iSCSI
configuration with high bandwidth and high latency. When network latency is
the limiting factor, will sending larger requests less often still help?

> ! <xref linkend="pgtestfsync"> module can be used to measure the
> ! average time in microseconds that a single WAL flush operation
> ! takes. A value of half of the average time the module reports it
> ! takes to flush after a single 8kB write operation is often the most
> ! effective setting for <varname>commit_delay</varname>. The benefit
> ! of tuning <varname>commit_delay</varname> can even be pronounced on
> ! storage media with very fast sync times, such as solid-state drives
> ! or RAID arrays with a battery-backed write cache. However, higher
> ! values of <varname>commit_siblings</varname> should be used is such

Typo: "is such" -> "in such"

> ! cases, whereas smaller <varname>commit_siblings</varname> values
> ! can be helpful on higher latency media. Note that it is quite
> ! possible that a setting of <varname>commit_delay</varname> that is
> ! too high can increase transaction latency by so much that total
> ! transaction throughput suffers.
> </para>

One would be foolish to run a performance-sensitive workload like those in
question, including the choice to have synchronous_commit=on, on spinning
disks with no battery-backed write cache. A cloud environment is more
credible, but my benchmark showed no gain there. Overall, I still won't
personally recommend changing commit_delay without measuring the performance
change for one's particular workload and storage environment. commit_delay
can now bring some impressive gains in the right situations, but I doubt those
are common enough for a fire-and-forget setting to do more good than harm.

I suggest having the documentation recommend half of the fsync time as a
starting point for benchmarking different commit_delay settings against your
own workload. Indicate that it's more likely to help for direct use of
spinning disks than for BBWC/solid state/virtualized storage. Not sure what
else can be credibly given as general advice for PostgreSQL DBAs.

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2013-01-27 02:49:49 Re: Request for vote to move forward with recovery.conf overhaul
Previous Message Craig Ringer 2013-01-27 01:27:13 Re: Visual Studio 2012 RC