Re: bgwriter changes

Lists: pgsql-hackers
From: Neil Conway <neilc(at)samurai(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: bgwriter changes
Date: 2004-12-14 13:30:54
Message-ID: 41BEEB0E.4070003@samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In recent discussion[1] with Simon Riggs, there has been some talk of
making some changes to the bgwriter. To summarize the problem, the
bgwriter currently scans the entire T1+T2 buffer lists and returns a
list of all the currently dirty buffers. It then selects a subset of
that list (computed using bgwriter_percent and bgwriter_maxpages) to
flush to disk. Not only does this mean we can end up scanning a
significant portion of shared_buffers for every invocation of the
bgwriter, we also do the scan while holding the BufMgrLock, likely
hurting scalability.

I think a fix for this in some fashion is warranted for 8.0. Possible
solutions:

(1) Special-case bgwriter_percent=100. The only reason we need to return
a list of all the dirty buffers is so that we can choose n% of them to
satisfy bgwriter_percent. That is obviously unnecessary if we have
bgwriter_percent=100. I think this change won't help most users,
*unless* we also change bgwriter_percent=100 in the default configuration.

(2) Remove bgwriter_percent. I have yet to hear anyone argue that
there's an actual need for bgwriter_percent in tuning bgwriter behavior,
and one less GUC var is a good thing, all else being equal. This is
effectively the same as #1 with the default changed, only less flexibility.

(3) Change the meaning of bgwriter_percent, per Simon's proposal. Make
it mean "the percentage of the buffer pool to scan, at most, to look for
dirty buffers". I don't think this is workable, at least not at this
point in the release cycle, because it means we might not smooth of
checkpoint load, one of the primary goals of the bgwriter (in this
proposal bgwriter would only ever consider writing out a small subset of
the total shared buffer cache: the least-recently-used n%, with 2% being
a suggested default). Some variant of this might be worth exploring for
8.1 though.

A patch (implementing #2) is attached -- any benchmark results would be
helpful. Increasing shared_buffers (to 10,000 or more) should make the
problem noticeable.

Opinions on which route is the best, or on some alternative solution? My
inclination is toward #2, but I'm not dead-set on it.

-Neil

[1] http://archives.postgresql.org/pgsql-hackers/2004-12/msg00386.php

Attachment Content-Type Size
bgwriter_rem_percent-1.patch text/x-patch 16.8 KB

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bgwriter changes
Date: 2004-12-14 13:54:40
Message-ID: 200412141354.iBEDse622726@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Neil Conway wrote:
> (2) Remove bgwriter_percent. I have yet to hear anyone argue that
> there's an actual need for bgwriter_percent in tuning bgwriter behavior,
> and one less GUC var is a good thing, all else being equal. This is
> effectively the same as #1 with the default changed, only less flexibility.

I prefer #2, and agree with you and Simon that something has to be done
for 8.0.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bgwriter changes
Date: 2004-12-14 14:23:27
Message-ID: 14167.1103034207@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Neil Conway <neilc(at)samurai(dot)com> writes:
> ...
> (2) Remove bgwriter_percent. I have yet to hear anyone argue that
> there's an actual need for bgwriter_percent in tuning bgwriter behavior,
> ...

Of the three offered solutions, I agree that that makes the most sense
(unless Jan steps up with a strong argument why this knob is needed).

However, due consideration should also be given to

(4) Do nothing until 8.1.

At this point in the release cycle I'm not sure we should be making
any significant changes for anything less than a crashing bug.

> A patch (implementing #2) is attached -- any benchmark results would be
> helpful. Increasing shared_buffers (to 10,000 or more) should make the
> problem noticeable.

I'd want to see some pretty impressive benchmark results before we
consider making a change now.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Neil Conway <neilc(at)samurai(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bgwriter changes
Date: 2004-12-14 15:23:49
Message-ID: 41BF0585.5080907@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

>However, due consideration should also be given to
>
>(4) Do nothing until 8.1.
>
>At this point in the release cycle I'm not sure we should be making
>any significant changes for anything less than a crashing bug.
>
>
>

If that's not the policy, then I don't understand the dev cycle state
labels used.

In the commercial world, my approach would be that if this was
determined to be necessary (about which I am moderately agnostic) then
we would abort the current RC stage, effectively postponing the release.

cheers

andrew


From: Neil Conway <neilc(at)samurai(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bgwriter changes
Date: 2004-12-15 00:18:17
Message-ID: 1103069897.28882.43.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2004-12-14 at 09:23 -0500, Tom Lane wrote:
> At this point in the release cycle I'm not sure we should be making
> any significant changes for anything less than a crashing bug.

Yes, that's true, and I am definitely hesitant to make changes during
RC. That said, "adjust bgwriter defaults" has been on the "open items"
list for quite some time -- in some sense #2 is just a variant on that
idea.

> I'd want to see some pretty impressive benchmark results before we
> consider making a change now.

http://archives.postgresql.org/pgsql-hackers/2004-12/msg00426.php

is with a patch from Simon that implements #3. While that's not exactly
the same as #2, it does seem to suggest that the performance difference
is rather noticeable. If the problem does indeed exacerbate BufMgrLock
contention, it might be more noticeable still on an SMP machine.

I'm going to try and get some more benchmark data; if anyone else wants
to try the patch and contribute results they are welcome to.

-Neil


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Neil Conway <neilc(at)samurai(dot)com>, Mark Wong <markw(at)osdl(dot)org>, Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bgwriter changes
Date: 2004-12-15 09:38:26
Message-ID: 1103103505.4037.3860.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2004-12-14 at 13:30, Neil Conway wrote:
> In recent discussion[1] with Simon Riggs, there has been some talk of
> making some changes to the bgwriter. To summarize the problem, the
> bgwriter currently scans the entire T1+T2 buffer lists and returns a
> list of all the currently dirty buffers. It then selects a subset of
> that list (computed using bgwriter_percent and bgwriter_maxpages) to
> flush to disk. Not only does this mean we can end up scanning a
> significant portion of shared_buffers for every invocation of the
> bgwriter, we also do the scan while holding the BufMgrLock, likely
> hurting scalability.

Neil's summary is very clear, many thanks.

There has been many suggestions, patches and test results, so I have
attempted to summarise everything here, using Neil's post to give
structure to the other information:

> I think a fix for this in some fashion is warranted for 8.0. Possible
> solutions:

I add 2 things to this structure
i) the name of the patch that implements that (authors initials)
ii) benchmark results published that run those

> (1) Special-case bgwriter_percent=100. The only reason we need to return
> a list of all the dirty buffers is so that we can choose n% of them to
> satisfy bgwriter_percent. That is obviously unnecessary if we have
> bgwriter_percent=100. I think this change won't help most users,
> *unless* we also change bgwriter_percent=100 in the default configuration.

100pct.patch (SR)

Test results to date:
1. Mark Kirkwood ([HACKERS] [Testperf-general] BufferSync and bgwriter)
pgbench 1xCPU 1xDisk shared_buffers=10000
showed 8.0RC1 had regressed compared with 7.4.6, but patch improved
performance significantly against 8.0RC1

Discounted now by both Neil and myself, since the same idea has been
more generally implemented as ideas (2) and (3) below.

> (2) Remove bgwriter_percent. I have yet to hear anyone argue that
> there's an actual need for bgwriter_percent in tuning bgwriter behavior,
> and one less GUC var is a good thing, all else being equal. This is
> effectively the same as #1 with the default changed, only less flexibility.

There are 2 patches published which do same thing:
- Partially implemented following Neil's suggestion: bg3.patch (SR)
- Fully implemented: bgwriter_rem_percent-1.patch (NC)
Patches have an identical effect on performance.

Test results to date:
1. Neil's testing was "inconclusive" for shared_buffers = 2500 on a
single cpu, single disk system (test used bgwriter_rem_percent-1.patch)
2. Mark Wong's OSDL tests published as test 211
analysis already posted on this thread;
dbt-2 4 CPU, many disk, shared_buffers=60000 (test used bg3.patch)
3% overall benefit, greatly reduced max transaction times
3. Mark Kirkwood's tests
pgbench 2xCPU 2xdisk, shared_buffers=10000 (test used
bgwriter_rem_percent-1.patch)
Showed slight regression against RC1 - must be test variability because
the patch does less work and is very unlikely to cause a regression

> (3) Change the meaning of bgwriter_percent, per Simon's proposal. Make
> it mean "the percentage of the buffer pool to scan, at most, to look for
> dirty buffers". I don't think this is workable, at least not at this
> point in the release cycle, because it means we might not smooth of
> checkpoint load, one of the primary goals of the bgwriter (in this
> proposal bgwriter would only ever consider writing out a small subset of
> the total shared buffer cache: the least-recently-used n%, with 2% being
> a suggested default). Some variant of this might be worth exploring for
> 8.1 though.

Implemented as bg2.patch (SR)
Contains a small bug, easily fixed, which would not effect performance

Test results to date:
1. Mark Kirkwood's tests
pgbench 2xCPU 2xdisk, shared_buffers=10000 (test used bg2.patch)
Showed improvement on RC1 and best option out of all three tests
(compared RC1, bg2.patch, bgwriter_rem_percent-1.patch), possibly
similar within bounds of test variability - but interesting enough to
investigate further.

Current situation seems to be:
- all test results indicate performance regressions in RC1 when
shared_buffers >= 10000 and using multi-cpu/multi-disk systems
- option (2) has the most thoroughly confirmable test results and is
thought by all parties to be the simplest and most robust approach.
- some more test results would be useful to compare, to ensure that
applying the patch would be useful in all circumstances.

Approach (3) looks interesting and should be investigated for 8.1, since
it introduces a subtlely different algorithm that may have "interesting
flight characteristics" and is more of a risk to the 8.0 release.

Thanks very much to all performance testers. It's important work.

--
Best Regards, Simon Riggs


From: Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Neil Conway <neilc(at)samurai(dot)com>, Mark Wong <markw(at)osdl(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bgwriter changes
Date: 2004-12-15 21:32:07
Message-ID: 41C0AD57.4030003@coretech.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:

>
>100pct.patch (SR)
>
>Test results to date:
>1. Mark Kirkwood ([HACKERS] [Testperf-general] BufferSync and bgwriter)
>pgbench 1xCPU 1xDisk shared_buffers=10000
>showed 8.0RC1 had regressed compared with 7.4.6, but patch improved
>performance significantly against 8.0RC1
>
>
>
>
It occurs to me that cranking up the number of transactions (say
1000->100000) and seeing if said regression persists would be
interesting. This would give the smoothing effect of the bgwriter (plus
the ARC) a better chance to shine.

regards

Mark


From: Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
To: Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Neil Conway <neilc(at)samurai(dot)com>, Mark Wong <markw(at)osdl(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bgwriter changes
Date: 2004-12-20 01:17:58
Message-ID: 41C62846.6050206@coretech.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Kirkwood wrote:

> It occurs to me that cranking up the number of transactions (say
> 1000->100000) and seeing if said regression persists would be
> interesting. This would give the smoothing effect of the bgwriter
> (plus the ARC) a better chance to shine.

I ran a few of these over the weekend - since it rained here :-) , and
the results are quite interesting:

[2xPIII, 2G, 2xATA RAID 0, FreeBSD 5.3 with the same non default Pg
parameters as before]

clients = 4 transactions = 100000 (/client), each test run twice

Version tps
7.4.6 49
8.0.0.0RC1 50
8.0.0.0RC1 + rem 49
8.0.0.0RC1 + bg2 50

Needless to way, all well within measurement error of each other (the
variability was about 1).

I suspect that my previous tests had too few transactions to trigger
many (or any) checkpoints. With them now occurring in the test, they
look to be the most significant factor (contrast with 70-80 tps for 4
clients with 1000 transactions).

Also with a small number of transactions, the fsyn'ed blocks may have
all fitted in the ATA disk caches (2x2M). In hindsight I should have
disabled this! (might run the smaller no. transactions again with
hw.ata.wc=0 and see if this is enlightening)

regards

Mark


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>
Cc: Neil Conway <neilc(at)samurai(dot)com>, Mark Wong <markw(at)osdl(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bgwriter changes
Date: 2004-12-20 08:01:43
Message-ID: 1103529703.2893.113.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2004-12-20 at 01:17, Mark Kirkwood wrote:
> Mark Kirkwood wrote:
>
> > It occurs to me that cranking up the number of transactions (say
> > 1000->100000) and seeing if said regression persists would be
> > interesting. This would give the smoothing effect of the bgwriter
> > (plus the ARC) a better chance to shine.
>
> I ran a few of these over the weekend - since it rained here :-) , and
> the results are quite interesting:
>
> [2xPIII, 2G, 2xATA RAID 0, FreeBSD 5.3 with the same non default Pg
> parameters as before]
>
> clients = 4 transactions = 100000 (/client), each test run twice
>
> Version tps
> 7.4.6 49
> 8.0.0.0RC1 50
> 8.0.0.0RC1 + rem 49
> 8.0.0.0RC1 + bg2 50
>
> Needless to way, all well within measurement error of each other (the
> variability was about 1).
>
> I suspect that my previous tests had too few transactions to trigger
> many (or any) checkpoints. With them now occurring in the test, they
> look to be the most significant factor (contrast with 70-80 tps for 4
> clients with 1000 transactions).
>
> Also with a small number of transactions, the fsyn'ed blocks may have
> all fitted in the ATA disk caches (2x2M). In hindsight I should have
> disabled this! (might run the smaller no. transactions again with
> hw.ata.wc=0 and see if this is enlightening)

These test results do seem to have greatly reduced variability: thanks.

>From what you say, this means parameter setting were: (?)
shared_buffers = 10000
bgwriter_delay = 200
bgwriter_maxpages = 100

My interpretation of this is that the bgwriter is not effective with
these (the default) parameter settings.

I think the optimum performance is by reducing both bgwriter_delay and
bgwriter_maxpages, though reducing the delay isn't sensibly possible
with 8.0RCn when shared_buffers is large.

--
Best Regards, Simon Riggs