Quick Links

synchronous commit vs. hint bits

Lists:	pgsql-hackers

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	synchronous commit vs. hint bits
Date:	2011-11-07 14:31:19
Message-ID:	CA+TgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h=w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I've long considered synchronous_commit=off to be one of our best
performance features. Certainly, it's not applicable in every
situation, but there are many applications where losing a second or so
worth of transactions is an acceptable price to pay for not needing to
wait for the disk to spin around for every commit. However, recent
experimentation has convinced me that it's got a serious downside:
SetHintBits() can't set HEAP_XMIN_COMMITTED or HEAP_XMAX_COMMITTED
hints until the commit record has been durably flushed to disk. It
turns out that can cause a major performance regression on systems
with many CPU cores. I fixed this for temporary and unlogged tables
in commit 53f1ca59b5875f1d3e95ee709ecaddcbdfdbd175, but the same issue
exists (without any clear fix) for permanent tables.

Here are some benchmark results on Nate Boley's 32-core AMD system.
These are pgbench -T 300 -c 32 -j 32 runs with scale factor 100,
shared_buffers = 8GB, maintenance_work_mem = 1GB, synchronous_commit =
off, checkpoint_segments = 300, checkpoint_timeout = 15min,
checkpoint_completion_target = 0.9:

tps = 8360.657049 (including connections establishing)
tps = 7818.766335 (including connections establishing)
tps = 8344.653290 (including connections establishing)

And here are the same results after lobotomizing SetHintBits() to
always sent the hint bits immediately (#if 0 around the
TransactionIdIsValid(xid) test):

tps = 9548.943930 (including connections establishing)
tps = 9579.485767 (including connections establishing)
tps = 9590.350954 (including connections establishing)

That's pretty significant - about a 15% improvement. That's quite
remarkable when you think about the fact that we're talking about
refraining from setting hint bits for just a fraction of a second.
The failure to sent those hint bits even for that very brief period of
time has to cause enough additional work (or lock contention) to
degrade performance quite noticeably.

So, what could we do about this? Ideas:

1. Set the hint bits right away, and avoid letting the page be flushed
to disk until the commit record is durably on disk (by bumping the
page LSN?).
2. Improve CLOG concurrency or performance in some way so that
consulting it repeatedly doesn't slow us down so much.
3. Do more backend-private XID status caching - in particular, for
commits, since this isn't a problem for aborts.
4. (Crazy idea) Have something that's like a hint bit, but stored in
the buffer header rather than the data block itself. We allocate an
array large enough to hold 2 bits per tuple (for the maximum number of
tuples that can exist on a page), with one bit indicating that xmin is
async-committed and the other indicating that xmax is async-committed.

There are probably other options as well.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-07 14:45:09
Message-ID:	CA+TgmobjN3k7GMFLvhF=a5oFheu2UmQ+gba19L5LLH2Z+K=Grw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 7, 2011 at 9:31 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> So, what could we do about this? Ideas:
>
> 1. Set the hint bits right away, and avoid letting the page be flushed
> to disk until the commit record is durably on disk (by bumping the
> page LSN?).
> 2. Improve CLOG concurrency or performance in some way so that
> consulting it repeatedly doesn't slow us down so much.
> 3. Do more backend-private XID status caching - in particular, for
> commits, since this isn't a problem for aborts.
> 4. (Crazy idea) Have something that's like a hint bit, but stored in
> the buffer header rather than the data block itself. We allocate an
> array large enough to hold 2 bits per tuple (for the maximum number of
> tuples that can exist on a page), with one bit indicating that xmin is
> async-committed and the other indicating that xmax is async-committed.
>
> There are probably other options as well.

5. Make the WAL writer more responsive, maybe using latches, so that
it doesn't take as long for the commit record to make it out to disk.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-07 15:12:08
Message-ID:	24728.1320678728@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> SetHintBits() can't set HEAP_XMIN_COMMITTED or HEAP_XMAX_COMMITTED
> hints until the commit record has been durably flushed to disk. It
> turns out that can cause a major performance regression on systems
> with many CPU cores.

It seems to me that you've jumped to proposing solutions before you know
where the problem actually is --- or at least, if you do know where the
problem is, you didn't explain it. Is the cost in repeating clog
lookups, or in testing to determine whether it's safe to set the bit
yet, or is it contention associated with one or the other of those?

regards, tom lane

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-07 15:25:20
Message-ID:	CAHyXU0zoVxkSr0fuLx_e_et1Nz-+QNMryJBFHmeDjcMBknrUtg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 7, 2011 at 8:31 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I've long considered synchronous_commit=off to be one of our best
> performance features. Certainly, it's not applicable in every
> situation, but there are many applications where losing a second or so
> worth of transactions is an acceptable price to pay for not needing to
> wait for the disk to spin around for every commit. However, recent
> experimentation has convinced me that it's got a serious downside:
> SetHintBits() can't set HEAP_XMIN_COMMITTED or HEAP_XMAX_COMMITTED
> hints until the commit record has been durably flushed to disk. It
> turns out that can cause a major performance regression on systems
> with many CPU cores. I fixed this for temporary and unlogged tables
> in commit 53f1ca59b5875f1d3e95ee709ecaddcbdfdbd175, but the same issue
> exists (without any clear fix) for permanent tables.

What's the source of the regression? Is it coming from losing the hint
bit and being forced out to clog? How likely is it really going to
happen in non synthetic real world cases?

Thinking about the hint bit cache I was playing with a while back, I
guess you could have put the hint bit in the cache but refrained from
marking it in the page in the TransactionIdIsValid(xid)=false case --
in the first implementation I had only put the bit in the cache when
it was valid -- since TransactionIdIsValid(xid) is not necessarily
cheap though, maybe it's worth reserving an extra bit for the
transaction being valid in the cache if you went down that road.

Another way to attack this problem is to re-check and set the hint bit
if you can do it in the bgwriter -- there's a good chance you will
catch it in oltp environments like pgbench although it not clear if
the cost to the general case would be too high.

merlin

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-07 15:37:30
Message-ID:	CAHyXU0x-T=rq1Chzjk9P=-QZbD6DTrhWbdAKt_fku4EahSkPOQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 7, 2011 at 9:12 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> SetHintBits() can't set HEAP_XMIN_COMMITTED or HEAP_XMAX_COMMITTED
>> hints until the commit record has been durably flushed to disk. It
>> turns out that can cause a major performance regression on systems
>> with many CPU cores.
>
> It seems to me that you've jumped to proposing solutions before you know
> where the problem actually is --- or at least, if you do know where the
> problem is, you didn't explain it. Is the cost in repeating clog
> lookups, or in testing to determine whether it's safe to set the bit
> yet, or is it contention associated with one or the other of those?

In the current code, if you get to the IsValid check and fail to set
the bit, you've essentially done all the work for no reason. I
tested this pretty well a few months back, and (recalling from
memory), the IsValid check is maybe 25% of the entire cost when you
fail through the hint bit -- this is why I organized the cache to only
store the bit if the xid was known good -- then you get to skip the
check in the known good case and immediately set the bit (w/o marking
dirty) and move on. As noted above, the cache I was playing with was
a win from performance point of view, but would require another bit to
address Robert's proposed case, and should really be prepared against
alternative solutions (like marking the bit in the bgwriter) before
being seriously considered.

merlin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-07 16:23:38
Message-ID:	CA+TgmoYnEvOi=9hPVEtR_z9U2+Nguru4=cvY9QJarFcApAE5hw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 7, 2011 at 10:12 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> SetHintBits() can't set HEAP_XMIN_COMMITTED or HEAP_XMAX_COMMITTED
>> hints until the commit record has been durably flushed to disk. It
>> turns out that can cause a major performance regression on systems
>> with many CPU cores.
>
> It seems to me that you've jumped to proposing solutions before you know
> where the problem actually is --- or at least, if you do know where the
> problem is, you didn't explain it. Is the cost in repeating clog
> lookups, or in testing to determine whether it's safe to set the bit
> yet, or is it contention associated with one or the other of those?

Good question. One possibly informative fact is that, on unlogged
tables, the same change doesn't seem to make any difference. Here are
the benchmark results with unlogged tables, configuration otherwise
identical to the OP:

[unpatched]
tps = 10624.851704 (including connections establishing)
tps = 10507.024822 (including connections establishing)
tps = 10714.411389 (including connections establishing)
[test whacked out]
tps = 10779.704540 (including connections establishing)
tps = 10523.863100 (including connections establishing)
tps = 10654.102699 (including connections establishing)

The difference might be noise, or it may be a very small real effect,
but it's clearly tiny compared to the change for permanent tables (but
note that this was not true prior to commit
53f1ca59b5875f1d3e95ee709ecaddcbdfdbd175). This seems to me to be
fairly compelling evidence that the problem is in the clog lookups
themselves, rather than the test that determines whether or not it's
safe to set the bit. However, I don't know whether the problem is the
cost of the test itself or some kind of associated contention. I
don't see much difference in CPU utilization between the patched and
unpatched code, but that's not really accurate enough to be certain.

I just reran both tests with LWLOCK_STATS defined. Again, five minute test run:

lwlock 11: shacq 87323748 exacq 3708555 blk 1932719 [unpatched]
lwlock 11: shacq 11682513 exacq 4769472 blk 677534 [patched]

11 = CLogControlLock, so you can see that the unpatched code is
acquiring CLogControlLock in shared mode more 7x as often and blocks
on the lock about 3x as often, despite processing fewer transactions.
The patched code has more exclusive-acquires, but that's at least
partly just because it's processing more transactions. Unfortunately,
I don't have oprofile access on this box and can't see exactly where
the time is being spent. However, I am not sure how much it matters.
With that much of an increase in CLOG traffic, something's gotta give.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-07 17:26:39
Message-ID:	CA+U5nM+SvyB-DZWJsmmuEnDM1gg5pVLc0akfRyzkEFQrr_6x-A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 7, 2011 at 2:45 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>> 2. Improve CLOG concurrency or performance in some way so that
>> consulting it repeatedly doesn't slow us down so much.

We should also ask what makes the clog slow. I think it shows physical
contention as well as logical contention on the lwlock. Since we have
2 bits per transaction that means we will see at least 256
transactions fitting in each cacheline in the clog. Consecutive
transactions are currently stored next to each other in the clog, so
that the "current" cacheline needs to be passed around between 256
transactions, one at a time. That is a problem if they all finish near
enough the same time.

My proposal is to stripe the clog, so that consecutive xids are not
adjacent in the clog, such that xids are always at least 64 bytes
apart on a 8192 byte clog page. That allows 128 commits with
consecutive xids to complete concurrently, with respect to physical
access to memory.

That's just a "one line" change in the defines at top of clog.c, so
easy enough to play with.

#define CACHELINE_SZ 64
#define CACHELINES_PER_BLOCK (BLCKSZ / CACHELINE_SZ)
#define CLOG_XACTS_PER_CACHELINE (CLOG_XACTS_PER_BYTE * CACHELINE_SZ)
#define TransactionIdToByte(xid) \
(CACHELINES_PER_BLOCK * \
(TransactionIdToPgIndex(xid) /CLOG_XACTS_PER_CACHELINE)) \
+ (TransactionIdToPgIndex(xid) % CLOG_XACTS_PER_CACHELINE)

plus few extra lines to fix the other defines.

> 5. Make the WAL writer more responsive, maybe using latches, so that
> it doesn't take as long for the commit record to make it out to disk.

I'm working on this already as part of the update for power
reduction/group commit/replication performance.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-07 18:05:11
Message-ID:	CA+TgmoYwNF6Rkcjv3CjSAF=wpvuv-V4C4MXueVExg1dZnXqrRA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 7, 2011 at 12:26 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> 5. Make the WAL writer more responsive, maybe using latches, so that
>> it doesn't take as long for the commit record to make it out to disk.
>
> I'm working on this already as part of the update for power
> reduction/group commit/replication performance.

OK. Here's an interesting result on that front that I so far can't
explain. I lowered wal_writer_delay to 20 ms and repeated my test:

tps = 10175.265689 (including connections establishing) [unpatched]
tps = 10159.597727 (including connections establishing) [patched]

Now, that's odd. I expect that to improve performance on the
unpatched code, by reducing the amount of time we have to wait for the
commit record to hit the disk. I did *not* expect it to improve the
performance of the patched code, since one would think that setting
the hint bit the first time through would be about as good as we could
possibly do. And yet, those are the numbers. Apparently, there's
some other effect whereby a more responsive walwriter improves
performance on this setup (beats me what it is, though).

Here it is with wal_writer_delay=50 ms:

tps = 9964.225358 (including connections establishing) [unpatched]
tps = 10048.396729 (including connections establishing) [patched]

And back to wal_writer_delay=200ms:

tps = 8119.121633 (including connections establishing) [unpatched]
tps = 9602.645495 (including connections establishing) [patched]

So it seems like there is quite a bit of win available here, though at
the moment I don't know why.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-07 18:08:13
Message-ID:	CAHyXU0xX8ZMd9Zh0p5Xk52-Q3t7L_rJY_Zw8V52tc72OBfagYA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 7, 2011 at 9:25 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Mon, Nov 7, 2011 at 8:31 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I've long considered synchronous_commit=off to be one of our best
>> performance features. Certainly, it's not applicable in every
>> situation, but there are many applications where losing a second or so
>> worth of transactions is an acceptable price to pay for not needing to
>> wait for the disk to spin around for every commit. However, recent
>> experimentation has convinced me that it's got a serious downside:
>> SetHintBits() can't set HEAP_XMIN_COMMITTED or HEAP_XMAX_COMMITTED
>> hints until the commit record has been durably flushed to disk. It
>> turns out that can cause a major performance regression on systems
>> with many CPU cores. I fixed this for temporary and unlogged tables
>> in commit 53f1ca59b5875f1d3e95ee709ecaddcbdfdbd175, but the same issue
>> exists (without any clear fix) for permanent tables.
>
> What's the source of the regression? Is it coming from losing the hint
> bit and being forced out to clog? How likely is it really going to
> happen in non synthetic real world cases?
>
> Thinking about the hint bit cache I was playing with a while back, I
> guess you could have put the hint bit in the cache but refrained from
> marking it in the page in the TransactionIdIsValid(xid)=false case --
> in the first implementation I had only put the bit in the cache when
> it was valid -- since TransactionIdIsValid(xid) is not necessarily
> cheap though, maybe it's worth reserving an extra bit for the
> transaction being valid in the cache if you went down that road.
>
> Another way to attack this problem is to re-check and set the hint bit
> if you can do it in the bgwriter -- there's a good chance you will
> catch it in oltp environments like pgbench although it not clear if
> the cost to the general case would be too high.

Thinking about this more, the backend local cache approach is probably
going to be useless in terms of addressing this problem -- mostly due
to the fact that the cache is, well, local. Even if backend A takes
the time to mark the bit in its own cache, backends B-Z haven't yet
and presumably by the time they do the page has been rolled out
anyways so you get no benefit. The cache helps when a backend sees
the same transaction spread out over a number of tuples/pages --
that's simply not the case in OLTP.

Doing the work in the bgwriter might do the trick assuming the
bgwriter consistently loses the race against both transaction
resolution and the wal, and the extra clog lookup (when you win the
race) penalty doesn't sting too muh...possibly do this in conjuction
with clog striping Simon is thinking about.

merlin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-07 18:19:29
Message-ID:	CA+TgmoYZVmXMSNUUQi_MM4-cHd39ZN6d7-7HX4i7je-vmXYyyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 7, 2011 at 1:08 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> Thinking about this more, the backend local cache approach is probably
> going to be useless in terms of addressing this problem -- mostly due
> to the fact that the cache is, well, local. Even if backend A takes
> the time to mark the bit in its own cache, backends B-Z haven't yet
> and presumably by the time they do the page has been rolled out
> anyways so you get no benefit. The cache helps when a backend sees
> the same transaction spread out over a number of tuples/pages --
> that's simply not the case in OLTP.

Ah, right. Good point.

> Doing the work in the bgwriter might do the trick assuming the
> bgwriter consistently loses the race against both transaction
> resolution and the wal, and the extra clog lookup (when you win the
> race) penalty doesn't sting too muh...

But I can't see how this can work. The background writer is only
designed to do one thing: ensuring a supply of clean buffers for
backends that need to allocate new ones. I'm not sure the background
writer is going to do anything at all on this test, since the data set
fits inside shared_buffers and therefore there's no buffer eviction
happening. But even if it does, it's certainly not going to scan all
1 million shared buffers anywhere near quick enough to matter; it's
going to be limited to at most 100 buffers every 200 ms, which means
that even if it ran at top speed for the entire test, it would only
get through about 15% of the buffer pool even *once* before the test
ended. That's not even slightly close to what would be needed to move
the needle here; you would need to visit any given buffer within a few
hundred milliseconds of the relevant transaction commit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-07 18:37:43
Message-ID:	CAHyXU0wYOmvmwk1fY3BHZCb0QKXiqBFwmYmoOggQfLnAXNEf3A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 7, 2011 at 12:19 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Nov 7, 2011 at 1:08 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> Thinking about this more, the backend local cache approach is probably
>> going to be useless in terms of addressing this problem -- mostly due
>> to the fact that the cache is, well, local. Even if backend A takes
>> the time to mark the bit in its own cache, backends B-Z haven't yet
>> and presumably by the time they do the page has been rolled out
>> anyways so you get no benefit. The cache helps when a backend sees
>> the same transaction spread out over a number of tuples/pages --
>> that's simply not the case in OLTP.
>
> Ah, right. Good point.
>
>> Doing the work in the bgwriter might do the trick assuming the
>> bgwriter consistently loses the race against both transaction
>> resolution and the wal, and the extra clog lookup (when you win the
>> race) penalty doesn't sting too muh...
>
> But I can't see how this can work. The background writer is only
> designed to do one thing: ensuring a supply of clean buffers for
> backends that need to allocate new ones. I'm not sure the background
> writer is going to do anything at all on this test, since the data set
> fits inside shared_buffers and therefore there's no buffer eviction
> happening. But even if it does, it's certainly not going to scan all
> 1 million shared buffers anywhere near quick enough to matter; it's
> going to be limited to at most 100 buffers every 200 ms, which means
> that even if it ran at top speed for the entire test, it would only
> get through about 15% of the buffer pool even *once* before the test
> ended. That's not even slightly close to what would be needed to move
> the needle here; you would need to visit any given buffer within a few
> hundred milliseconds of the relevant transaction commit.

Well, I'd argue that in most real world, high write intensity
databases there is constant pressure on pages to be flushed out to
make room for new ones being written to and the database size is much,
much larger than shared buffers -- pgbench is 100% update and pretty
novel in that respect. I guess I said 'bgwriter' when I really meant
'generally upon eviction, either by bgwriter or an evicting backend'.
But even given that, probably the lag is too long to be of useful
benefit to your problem.

merlin

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-07 23:33:25
Message-ID:	CA+U5nM+NqsZJ_L4ciDSH4p_y1NQZHk8=TNiW-aHUXCOR0Xs0=g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 7, 2011 at 5:26 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>> 5. Make the WAL writer more responsive, maybe using latches, so that
>> it doesn't take as long for the commit record to make it out to disk.
>
> I'm working on this already as part of the update for power
> reduction/group commit/replication performance.

I extracted this from my current patch for you to test.

Rather useful actually 'cos its allowed me a sensible phasing of the
development.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
walwriter_latch.v2.patch	application/octet-stream	4.9 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-08 02:35:23
Message-ID:	CA+TgmoYqU5PVhPWFr+vdrrLbrZm3KmmQ5DTNYk6jugdG-te8Tg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 7, 2011 at 6:33 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Mon, Nov 7, 2011 at 5:26 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
>>> 5. Make the WAL writer more responsive, maybe using latches, so that
>>> it doesn't take as long for the commit record to make it out to disk.
>>
>> I'm working on this already as part of the update for power
>> reduction/group commit/replication performance.
>
> I extracted this from my current patch for you to test.

Thank you!

> Rather useful actually 'cos its allowed me a sensible phasing of the
> development.

+1.

Hmm, this is different than what I was expecting, although that's not
necessarily bad. What this does is retain wal_writer_delay, but allow
the WAL writer to be woken up more frequently if there's enough WAL to
justify it. What I was expecting you to do is eliminate
wal_writer_delay altogether and drive the wakeups entirely off of the
latch. I think you could get away with that, because SetLatch is
ridiculously cheap if the latch is already set.

Anyway, I'll give this a spin as you have it and see what falls out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-08 05:11:52
Message-ID:	85FF3FFB-E533-4E85-8047-7FA4F05A99AA@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Nov 7, 2011, at 9:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Nov 7, 2011 at 6:33 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On Mon, Nov 7, 2011 at 5:26 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>
>>>> 5. Make the WAL writer more responsive, maybe using latches, so that
>>>> it doesn't take as long for the commit record to make it out to disk.
>>>
>>> I'm working on this already as part of the update for power
>>> reduction/group commit/replication performance.
>>
>> I extracted this from my current patch for you to test.
>
> Thank you!
>
>> Rather useful actually 'cos its allowed me a sensible phasing of the
>> development.
>
> +1.
>
> <reads patch>
>
> Hmm, this is different than what I was expecting, although that's not
> necessarily bad. What this does is retain wal_writer_delay, but allow
> the WAL writer to be woken up more frequently if there's enough WAL to
> justify it. What I was expecting you to do is eliminate
> wal_writer_delay altogether and drive the wakeups entirely off of the
> latch.

Oh, I think I see why you didn't do that...

Anyway, I'll try to post test results in the morning.

...Robert

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-08 06:59:19
Message-ID:	CA+U5nML2RAVoWC0EdoLRC2muAgcNoBvowx0GNBpmf4Lwb_FpsA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 8, 2011 at 2:35 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> What I was expecting you to do is eliminate
> wal_writer_delay altogether and drive the wakeups entirely off of the
> latch.

Please continue to expect that, I just haven't finished it yet...

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-08 14:09:09
Message-ID:	CA+TgmobJZn5xjdKMXEZsxNdnEWoTHddr0eKLq=dyh1h_0oM-tQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 8, 2011 at 1:59 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Please continue to expect that, I just haven't finished it yet...

OK.

So here's the deal: this is an effective, mostly automatic solution to
the performance problem noted in the original post. For example, at
32 clients, the original test case gives about 7800-8300 tps with
wal_writer_delay=200ms, and about 10100 tps with
wal_writer_delay=20ms. With wal_writer_delay=200ms but the patch
applied, median of three five minute pgbench runs is 9952 tps; all
three runs are under 10000 tps. So it's not quite as good as
adjusting wal_writer_delay downward, but it gives you roughly 90% of
the benefit automatically, without needing to adjust any settings.
That seems very worthwhile.

At 1 client, 8 clients, and 80 clients, the results were even better.
The patched code with wal_writer_delay=200ms slightly outperformed the
unpatched code with wal_writer_delay=20ms (and outperformed the
unpatched code with wal_writer_delay=200ms even more). It's possible
that some of that is random variation, but maybe not all of it - e.g.
at 1 client:

unpatched, wal_writer_delay = 200ms: 602, 604, 607 tps
unpatched, wal_writer_delay = 20ms: 614, 616, 616 tps
patched, wal_writer_delay = 200ms: 633, 634, 636 tps

The fact that those numbers aren't bouncing around much suggests that
it might be a real effect.

I have also reviewed the code and it seems OK.

So +1 from me for applying this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	YAMAMOTO Takashi <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>
Cc:	simon(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-29 14:17:52
Message-ID:	CA+TgmoYQgDnrhmbBnYRgW60wwd85D7kPZ5EK3=9vOtyfF3Ur-g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 29, 2011 at 1:42 AM, YAMAMOTO Takashi
<yamt(at)mwd(dot)biglobe(dot)ne(dot)jp> wrote:
>> On Mon, Nov 7, 2011 at 5:26 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>
>>>> 5. Make the WAL writer more responsive, maybe using latches, so that
>>>> it doesn't take as long for the commit record to make it out to disk.
>>>
>>> I'm working on this already as part of the update for power
>>> reduction/group commit/replication performance.
>>
>> I extracted this from my current patch for you to test.
>
> is it expected to produce more frequent fsync?

Yes, I would expect that. What kind of increase are you seeing? Is
it causing a problem for you, or are you just making an observation?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi)
To:	robertmhaas(at)gmail(dot)com
Cc:	simon(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-30 06:37:29
Message-ID:	20111130063729.C15EE14A223@mail.netbsd.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

hi,

> On Tue, Nov 29, 2011 at 1:42 AM, YAMAMOTO Takashi
> <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp> wrote:
>>> On Mon, Nov 7, 2011 at 5:26 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>>
>>>>> 5. Make the WAL writer more responsive, maybe using latches, so that
>>>>> it doesn't take as long for the commit record to make it out to disk.
>>>>
>>>> I'm working on this already as part of the update for power
>>>> reduction/group commit/replication performance.
>>>
>>> I extracted this from my current patch for you to test.
>>
>> is it expected to produce more frequent fsync?
>
> Yes, I would expect that. What kind of increase are you seeing? Is
> it causing a problem for you, or are you just making an observation?

i was curious because my application uses async commits mainly to
avoid frequent fsync. i have no numbers right now.

YAMAMOTO Takashi

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	YAMAMOTO Takashi <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>
Cc:	simon(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-11-30 13:10:00
Message-ID:	CA+TgmobxA-b8sZey9wvcFND9P3W2Pub1haatpg5pfRKZ-6sA8w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 30, 2011 at 1:37 AM, YAMAMOTO Takashi
<yamt(at)mwd(dot)biglobe(dot)ne(dot)jp> wrote:
>> Yes, I would expect that. What kind of increase are you seeing? Is
>> it causing a problem for you, or are you just making an observation?
>
> i was curious because my application uses async commits mainly to
> avoid frequent fsync. i have no numbers right now.

Oh, that's interesting. Why do you want to avoid frequent fsyncs? I
thought the point of synchronous_commit=off was to move the fsyncs to
the background, but not necessarily to decrease the frequency.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	yamt(at)mwd(dot)biglobe(dot)ne(dot)jp (YAMAMOTO Takashi)
To:	robertmhaas(at)gmail(dot)com
Cc:	simon(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-12-01 06:29:32
Message-ID:	20111201062932.DF04014A257@mail.netbsd.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

hi,

> On Wed, Nov 30, 2011 at 1:37 AM, YAMAMOTO Takashi
> <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp> wrote:
>>> Yes, I would expect that. What kind of increase are you seeing? Is
>>> it causing a problem for you, or are you just making an observation?
>>
>> i was curious because my application uses async commits mainly to
>> avoid frequent fsync. i have no numbers right now.
>
> Oh, that's interesting. Why do you want to avoid frequent fsyncs?

simply because it was expensive on my environment.

> I
> thought the point of synchronous_commit=off was to move the fsyncs to
> the background, but not necessarily to decrease the frequency.

it makes sense.
but it's normal for users to abuse features. :)

YAMAMOTO Takashi

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	YAMAMOTO Takashi <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>, simon(at)2ndquadrant(dot)com
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-12-01 09:09:06
Message-ID:	201112011009.06587.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Robert,
On Wednesday, November 30, 2011 02:10:00 PM Robert Haas wrote:
> On Wed, Nov 30, 2011 at 1:37 AM, YAMAMOTO Takashi
>
> <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp> wrote:
> >> Yes, I would expect that. What kind of increase are you seeing? Is
> >> it causing a problem for you, or are you just making an observation?
> >
> > i was curious because my application uses async commits mainly to
> > avoid frequent fsync. i have no numbers right now.
> Oh, that's interesting. Why do you want to avoid frequent fsyncs? I
> thought the point of synchronous_commit=off was to move the fsyncs to
> the background, but not necessarily to decrease the frequency.
Is that so? If it wouldn't avoid fsyncs how could you reach multiple thousand
TPS in a writing pgbench run on a pretty ordinary system with fsync=on?

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, YAMAMOTO Takashi <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>, simon(at)2ndquadrant(dot)com
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-12-01 14:11:43
Message-ID:	CA+TgmoapnMyXBYqqCKQsF6ab6Yyr7LszNTK3FanjH1ZEYjnPzA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 1, 2011 at 4:09 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> Oh, that's interesting. Why do you want to avoid frequent fsyncs? I
>> thought the point of synchronous_commit=off was to move the fsyncs to
>> the background, but not necessarily to decrease the frequency.
> Is that so? If it wouldn't avoid fsyncs how could you reach multiple thousand
> TPS in a writing pgbench run on a pretty ordinary system with fsync=on?

Eh, well, what would stop you from achieving that? An fsync operation
that occurs in the background doesn't block further transactions from
completing. Meanwhile, getting the WAL records on disk faster allows
us to set hint bits sooner, which is a significant win, as shown by
the numbers I posted upthread.

One possible downside of trying to kick off the fsync more quickly is
that if there are a continuous stream of background fsyncs going on, a
process that needs to do an XLogFlush in the foreground (i.e. a
synchronous_commit=on transaction in the middle of many
synchronous_commit=off transactions) might be more likely to find an
fsync already in progress and therefore need to wait until it
completes before starting the next one, slowing things down. But I'm
a bit reluctant to believe that is a real effect without some data.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, YAMAMOTO Takashi <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>, simon(at)2ndquadrant(dot)com
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-12-01 14:18:29
Message-ID:	201112011518.29964.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thursday, December 01, 2011 03:11:43 PM Robert Haas wrote:
> On Thu, Dec 1, 2011 at 4:09 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> Oh, that's interesting. Why do you want to avoid frequent fsyncs? I
> >> thought the point of synchronous_commit=off was to move the fsyncs to
> >> the background, but not necessarily to decrease the frequency.
> >
> > Is that so? If it wouldn't avoid fsyncs how could you reach multiple
> > thousand TPS in a writing pgbench run on a pretty ordinary system with
> > fsync=on?
> Eh, well, what would stop you from achieving that? An fsync operation
> that occurs in the background doesn't block further transactions from
> completing.
But it will slow down overall system io. For one an fsync() on linux will
cause a queue drain on the io submit queue. For another it counts against the
total available random io ops a device can do.
Which in turn will cause slowdown for anything else doing syncronous random
io. I.e. read(2).

> Meanwhile, getting the WAL records on disk faster allows
> us to set hint bits sooner, which is a significant win, as shown by
> the numbers I posted upthread.
Oh, that part I dont doubt. Sorry for that.

Andres

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, YAMAMOTO Takashi <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>, simon(at)2ndquadrant(dot)com
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-12-01 14:58:14
Message-ID:	CAMkU=1w4RTaSCG8wXM4DxawUrNCXGTKSkPY9JTzxte7jwWekBw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 1, 2011 at 6:11 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> One possible downside of trying to kick off the fsync more quickly is
> that if there are a continuous stream of background fsyncs going on, a
> process that needs to do an XLogFlush in the foreground (i.e. a
> synchronous_commit=on transaction in the middle of many
> synchronous_commit=off transactions) might be more likely to find an
> fsync already in progress and therefore need to wait until it
> completes before starting the next one, slowing things down.

Waiting until the other one completes is how it currently is
implemented, but is it necessary from a correctness view? It seems
like the WALWriteLock only needs to protect the write, and not the
sync (assuming the sync method allows those to be separate actions),
and that there could be multiple fsync requests from different
processes pending at the same time without a correctness problem.
After dropping the WALWriteLock and doing the fsync, it would then
have to take the lock again or maybe just a spinlock to update the
accounting for how far the log has been flushed. So rather than one
committing process blocking on fsync and bunch of others blocking on
WALWriteLock, you could have all of them blocking on different fsyncs
and let the kernel deal with waking them up. I don't know at all
whether this would actually be an improvement, assuming it would even
be safe. Reading the xlog.c code, it is hard to tell which
designs/features are there for safety and which ones are there for
suspected performance reasons.

Sorry for high-jacking your topic, it is just something I had been
thinking about for a while.

Cheers,

Jeff

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, YAMAMOTO Takashi <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>, simon(at)2ndquadrant(dot)com
Subject:	Re: synchronous commit vs. hint bits
Date:	2011-12-01 16:47:52
Message-ID:	CA+TgmoZZrLz9TaZs0kG74o3S3YU_XJNOW5dkq5gakQZ=4RmiMA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 1, 2011 at 9:58 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> Waiting until the other one completes is how it currently is
> implemented, but is it necessary from a correctness view? It seems
> like the WALWriteLock only needs to protect the write, and not the
> sync (assuming the sync method allows those to be separate actions),
> and that there could be multiple fsync requests from different
> processes pending at the same time without a correctness problem.

I've wondered about that, too. At least on Linux, the overhead of a
system call seems to be pretty low - e.g. the ridiculous number of
lseek calls we do on a pgbench -S doesn't seem create much overhead
until the inode mutex starts to become contended; and that problem
should be fixed in Linux 3.2. But I'm not sure if system calls are
similarly cheap on all platforms, or even if it's true on Linux for
fsync() in particular.

There's another possible approach here, too: instead of waiting to set
hint bits until the commit record hits the disk, we could allow the
hint bits to set immediately on the condition that we don't write it
out until the commit record hits the disk. Bumping the page LSN would
do that, but I think that might be problematic since setting hint bits
isn't WAL-logged. If so, we could possibly fix that by storing a
second LSN for the page out of line, e.g. in the buffer descriptor.
That might be even faster than speeding up the WAL flush.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company