Re: Bgwriter strategies

Lists: pgsql-hackers
From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Bgwriter strategies
Date: 2007-07-05 20:50:55
Message-ID: 468D59AF.1050308@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I ran some DBT-2 tests to compare different bgwriter strategies:

http://community.enterprisedb.com/bgwriter/

imola-336 was run with minimal bgwriter settings, so that most writes
are done by backends. imola-337 was patched with an implementation of
Tom's bgwriter idea, trying to aggressively keep all pages with
usage_count=0 clean. Imola-340 was with a patch along the lines of
Itagaki's original patch, ensuring that there's as many clean pages in
front of the clock head as were consumed by backends since last bgwriter
iteration.

All test runs were also patched to count the # of buffer allocations,
and # of buffer flushes performed by bgwriter and backends. Here's those
results (I hope the intendation gets through properly):

imola-336 imola-337 imola-340
writes by checkpoint 38302 30410 39529
writes by bgwriter 350113 2205782 1418672
writes by backends 1834333 265755 787633
writes total 2222748 2501947 2245834
allocations 2683170 2657896 2699974

It looks like Tom's idea is not a winner; it leads to more writes than
necessary. But the OS caches the writes, so let's look at the actual I/O
performed to be sure, from iostat:

http://community.enterprisedb.com/bgwriter/writes-336-337-340.jpg

The graph shows that on imola-337, there was indeed more write traffic
than on the other two test runs.

On imola-340, there's still a significant amount of backend writes. I'm
still not sure what we should be aiming at. Is 0 backend writes our goal?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-05 21:28:00
Message-ID: 21957.1183670880@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> imola-336 imola-337 imola-340
> writes by checkpoint 38302 30410 39529
> writes by bgwriter 350113 2205782 1418672
> writes by backends 1834333 265755 787633
> writes total 2222748 2501947 2245834
> allocations 2683170 2657896 2699974

> It looks like Tom's idea is not a winner; it leads to more writes than
> necessary.

The incremental number of writes is not that large; only about 10% more.
The interesting thing is that those "extra" writes must represent
buffers that were re-touched after their usage_count went to zero, but
before they could be recycled by the clock sweep. While you'd certainly
expect some of that, I'm surprised it is as much as 10%. Maybe we need
to play with the buffer allocation strategy some more.

The very small difference in NOTPM among the three runs says that either
this whole area is unimportant, or DBT2 isn't a good test case for it;
or maybe that there's something wrong with the patches?

> On imola-340, there's still a significant amount of backend writes. I'm
> still not sure what we should be aiming at. Is 0 backend writes our goal?

Well, the lower the better, but not at the cost of a very large increase
in total writes.

> Imola-340 was with a patch along the lines of
> Itagaki's original patch, ensuring that there's as many clean pages in
> front of the clock head as were consumed by backends since last bgwriter
> iteration.

This seems intuitively wrong, since in the presence of bursty request
behavior it'll constantly be getting caught short of buffers. I think
you need a safety margin and a moving-average decay factor. Possibly
something like

buffers_to_clean = Max(buffers_used * 1.1,
buffers_to_clean * 0.999);

where buffers_used is the current observation of demand. This would
give us a safety margin such that buffers_to_clean is not less than
the largest demand observed in the last 100 iterations (0.999 ^ 100
is about 0.90, cancelling out the initial 10% safety margin), and it
takes quite a while for the memory of a demand spike to be forgotten
completely.

regards, tom lane


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-05 21:59:46
Message-ID: Pine.GSO.4.64.0707051724110.1437@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 5 Jul 2007, Heikki Linnakangas wrote:

> It looks like Tom's idea is not a winner; it leads to more writes than
> necessary.

What I came away with as the core of Tom's idea is that the cleaning/LRU
writer shouldn't ever scan the same section of the buffer cache twice,
because anything that resulted in a new dirty buffer will be unwritable by
it until the clock sweep passes over it. I never took that to mean that
idea necessarily had to be implemented as "trying to aggressively keep all
pages with usage_count=0 clean".

I've been making slow progress on this myself, and the question I've been
trying to answer is whether this fundamental idea really matters or not.
One clear benefit of that alternate implementation should allow is setting
a lower value for the interval without being as concerned that you're
wasting resources by doing so, which I've found to a problem with the
current implementation--it will consume a lot of CPU scanning the same
section right now if you lower that too much.

As far as your results, first off I'm really glad to see someone else
comparing checkpoint/backend/bgwriter writes the same I've been doing so I
finally have someone else's results to compare against. I expect that the
optimal approach here is a hybrid one that structures scanning the buffer
cache the new way Tom suggests, but limits the number of writes to "just
enough". I happen to be fond of the "just enough" computation based on a
weighted moving average I wrote before, but there's certainly room for
multiple implementations of that part of the code to evolve.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 08:13:35
Message-ID: Pine.GSO.4.64.0707060352250.22042@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 5 Jul 2007, Tom Lane wrote:

> This would give us a safety margin such that buffers_to_clean is not
> less than the largest demand observed in the last 100 iterations...and
> it takes quite a while for the memory of a demand spike to be forgotten
> completely.

If you tested this strategy even on a steady load, I'd expect you'll find
there are large spikes in allocations during the occasional period where
everything is just right to pull a bunch of buffers in, and if you let
that max linger around for 100 iterations you'll write a large number of
buffers more than you need. That's what I saw when I tried to remember
too much information about allocation history in the version of the auto
LRU tuner I worked on. For example, with 32000 buffers, with pgbench
trying to UPDATE as fast as possible I sometimes hit
>1500 allocations in an interval, but the steady-state allocation level
was closer to 500.

I ended up settling on max(moving average of the last 16,most recent
allocation), and that seemed to work pretty well without being too
wasteful from excessive writes. Playing with multiples of 2, 8 was
definately not enough memory to smooth usefully, while 32 seemed a little
sluggish on the entry and wasteful on the exit ends.

At the default interval, 16 iterations is looking back at the previous 3.2
seconds. I have a feeling the proper tuning for this should be
time-based, where you would decide how long ago to consider looking back
for and compute the iterations based on that.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 09:55:28
Message-ID: 468E1190.8050902@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith wrote:
> On Thu, 5 Jul 2007, Heikki Linnakangas wrote:
>
>> It looks like Tom's idea is not a winner; it leads to more writes than
>> necessary.
>
> What I came away with as the core of Tom's idea is that the cleaning/LRU
> writer shouldn't ever scan the same section of the buffer cache twice,
> because anything that resulted in a new dirty buffer will be unwritable
> by it until the clock sweep passes over it. I never took that to mean
> that idea necessarily had to be implemented as "trying to aggressively
> keep all pages with usage_count=0 clean".
>
> I've been making slow progress on this myself, and the question I've
> been trying to answer is whether this fundamental idea really matters or
> not. One clear benefit of that alternate implementation should allow is
> setting a lower value for the interval without being as concerned that
> you're wasting resources by doing so, which I've found to a problem with
> the current implementation--it will consume a lot of CPU scanning the
> same section right now if you lower that too much.

Yes, in fact ignoring the CPU overhead of scanning the same section over
and over again, Tom's proposal is the same as setting both
bgwriter_lru_* settings all the way up to the max. In fact I ran a DBT-2
test like that as well, and the # of writes was indeed the same, just
with a max higher CPU usage. It's clear that scanning the same section
over and over again has been a waste of time in previous releases.

As a further data point, I constructed a smaller test case that performs
random DELETEs on a table using an index. I varied the # of
shared_buffers, and ran the test with bgwriter disabled, or tuned all
the way up to the maximum. Here's the results from that:

shared_buffers | writes | writes | writes_ratio
----------------+--------+--------+-------------------
2560 | 86936 | 88023 | 1.01250345081439
5120 | 81207 | 84551 | 1.04117871612053
7680 | 75367 | 80603 | 1.06947337694216
10240 | 69772 | 74533 | 1.06823654187926
12800 | 64281 | 69237 | 1.07709898725907
15360 | 58515 | 64735 | 1.10629753054772
17920 | 53231 | 58635 | 1.10151979109917
20480 | 48128 | 54403 | 1.13038148271277
23040 | 43087 | 49949 | 1.15925917330053
25600 | 39062 | 46477 | 1.1898264297783
28160 | 35391 | 43739 | 1.23587917832217
30720 | 32713 | 37480 | 1.14572188426619
33280 | 31634 | 31677 | 1.00135929695897
35840 | 31668 | 31717 | 1.00154730327144
38400 | 31696 | 31693 | 0.999905350832913
40960 | 31685 | 31730 | 1.00142023039293
43520 | 31694 | 31650 | 0.998611724616647
46080 | 31661 | 31650 | 0.999652569407157

The first writes-column is the # of writes with bgwriter disabled, 2nd
column is with the aggressive bgwriter. The table size is 33334 pages,
so after that the table fits in cache and the bgwriter strategy makes no
difference.

> As far as your results, first off I'm really glad to see someone else
> comparing checkpoint/backend/bgwriter writes the same I've been doing so
> I finally have someone else's results to compare against. I expect that
> the optimal approach here is a hybrid one that structures scanning the
> buffer cache the new way Tom suggests, but limits the number of writes
> to "just enough". I happen to be fond of the "just enough" computation
> based on a weighted moving average I wrote before, but there's certainly
> room for multiple implementations of that part of the code to evolve.

We need to get the requirements straight.

One goal of bgwriter is clearly to keep just enough buffers clean in
front of the clock hand so that backends don't need to do writes
themselves until the next bgwriter iteration. But not any more than
that, otherwise we might end up doing more writes than necessary if some
of the buffers are redirtied.

To deal with bursty workloads, for example a batch of 2 GB worth of
inserts coming in every 10 minutes, it seems we want to keep doing a
little bit of cleaning even when the system is idle, to prepare for the
next burst. The idea is to smoothen the physical I/O bursts; if we don't
clean the dirty buffers left over from the previous burst during the
idle period, the I/O system will be bottlenecked during the bursts, and
sit idle otherwise.

To strike a balance between cleaning buffers ahead of possible bursts in
the future and not doing unnecessary I/O when no such bursts come, I
think a reasonable strategy is to write buffers with usage_count=0 at a
slow pace when there's no buffer allocations happening.

To smoothen the small variations on a relatively steady workload, the
weighted average sounds good.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 10:01:30
Message-ID: 468E12FA.70603@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> imola-336 imola-337 imola-340
>> writes by checkpoint 38302 30410 39529
>> writes by bgwriter 350113 2205782 1418672
>> writes by backends 1834333 265755 787633
>> writes total 2222748 2501947 2245834
>> allocations 2683170 2657896 2699974
>
>> It looks like Tom's idea is not a winner; it leads to more writes than
>> necessary.
>
> The incremental number of writes is not that large; only about 10% more.
> The interesting thing is that those "extra" writes must represent
> buffers that were re-touched after their usage_count went to zero, but
> before they could be recycled by the clock sweep. While you'd certainly
> expect some of that, I'm surprised it is as much as 10%. Maybe we need
> to play with the buffer allocation strategy some more.
>
> The very small difference in NOTPM among the three runs says that either
> this whole area is unimportant, or DBT2 isn't a good test case for it;
> or maybe that there's something wrong with the patches?
>
>> On imola-340, there's still a significant amount of backend writes. I'm
>> still not sure what we should be aiming at. Is 0 backend writes our goal?
>
> Well, the lower the better, but not at the cost of a very large increase
> in total writes.
>
>> Imola-340 was with a patch along the lines of
>> Itagaki's original patch, ensuring that there's as many clean pages in
>> front of the clock head as were consumed by backends since last bgwriter
>> iteration.
>
> This seems intuitively wrong, since in the presence of bursty request
> behavior it'll constantly be getting caught short of buffers. I think
> you need a safety margin and a moving-average decay factor. Possibly
> something like
>
> buffers_to_clean = Max(buffers_used * 1.1,
> buffers_to_clean * 0.999);
>
> where buffers_used is the current observation of demand. This would
> give us a safety margin such that buffers_to_clean is not less than
> the largest demand observed in the last 100 iterations (0.999 ^ 100
> is about 0.90, cancelling out the initial 10% safety margin), and it
> takes quite a while for the memory of a demand spike to be forgotten
> completely.

That would be overly aggressive on a workload that's steady on average,
but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0
100. You'd end up writing ~100 pages on every bgwriter round, but you
only need an average of 20 pages per round. That'd be effectively the
same as keeping all buffers with usage_count=0 clean.

BTW, I believe that kind of workload is actually very common. That's
what you get if one transaction causes say 10-100 buffer allocations,
and you execute one such transaction every few seconds.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 10:08:57
Message-ID: Pine.GSO.4.64.0707060543330.3474@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I just got my own first set of useful tests of using the new "remember
where you last scanned to" BGW implementation suggested by Tom. What I
did was keep the exiting % to scan, but cut back the number to scan when
so close to a complete lap ahead of the strategy point that I'd cross it
if I scanned that much. So when the system was idle, it would very
quickly catch up with the strategy point, but if the %/max numbers were
low it's possible for it to fall behind.

My workload was just the UPDATE statement out of pgbench with a database
of scale 25 (~400MB, picked so most operations were in memory), which
pushes lots of things in and out of the buffer cache as fast as possible.

Here's some data with no background writer at all:

clients tps buf_clean buf_backend buf_alloc
1 1340 0 72554 96846
2 1421 0 73969 88879
3 1418 0 71452 86339
4 1344 0 75184 90187
8 1361 0 73063 88099
15 1348 0 71861 86923

And here's what I got with the new approach, using 10% for the scan
percentage and a maximum of 200 buffers written out. I picked those
numbers after some experimentation because they were the first I found
where the background writer was almost always riding right behind the
strategy point; with lower numbers, when the background writer woke up it
often found it had already fallen behind the stategy point and had to
start cleaning forward the old way instead, which wasn't what I wanted to
test.

clients tps buf_clean buf_backend buf_alloc
1 1261 122917 150 105655
2 1186 126663 26 97586
3 1154 127780 21 98077
4 1181 127685 19 98068
8 1076 128597 2 98229
15 1065 128399 5 98143

As you can see, I achieved the goal of almost never having a backend write
its own buffer, so yeah for that. That's the only good thing I can say
about it though. The TPS results take a moderate dive, and there's about
10% more buffer allocations. The big and obvious issues is that I'm
writing almost 75% more buffers this way--way worse even than the 10%
extra overhead Heikki was seeing. But since I've going out of my way to
find a worse-case for this code, I consider mission accomplished there.

Anyway, will have more detailed reports to post after I collect some more
data; for now I just wanted to join Heikki in confirming that the strategy
of trying to get the LRU cleaner to ride right behind the strategy point
can really waste a whole lot of writes.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 10:30:39
Message-ID: 468E19CF.8040004@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith wrote:
> As you can see, I achieved the goal of almost never having a backend
> write its own buffer, so yeah for that. That's the only good thing I
> can say about it though. The TPS results take a moderate dive, and
> there's about 10% more buffer allocations. The big and obvious issues
> is that I'm writing almost 75% more buffers this way--way worse even
> than the 10% extra overhead Heikki was seeing. But since I've going out
> of my way to find a worse-case for this code, I consider mission
> accomplished there.

There's something wrong with that. The number of buffer allocations
shouldn't depend on the bgwriter strategy at all.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 10:47:54
Message-ID: Pine.GSO.4.64.0707060638520.3474@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 6 Jul 2007, Heikki Linnakangas wrote:

> There's something wrong with that. The number of buffer allocations shouldn't
> depend on the bgwriter strategy at all.

I was seeing a smaller (closer to 5%) increase in buffer allocations
switching from no background writer to using the stock one before I did
any code tinkering, so it didn't strike me as odd. I believe it's related
to the TPS numbers. When there are more transactions being executed per
unit time, it's more likely the useful blocks will stay in memory because
their usage_count is getting tickled faster, and therefore there's less of
the most useful blocks being swapped out only to be re-allocated again
later.

Since the bad bgwriter tunings reduce TPS, I believe that's the mechanism
by which there are more allocations needed. I'll try to keep an eye on
this now that you've brought it up.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Greg Smith" <gsmith(at)gregsmith(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 10:50:51
Message-ID: 1183719051.4488.35.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2007-07-06 at 10:55 +0100, Heikki Linnakangas wrote:

> We need to get the requirements straight.
>
> One goal of bgwriter is clearly to keep just enough buffers clean in
> front of the clock hand so that backends don't need to do writes
> themselves until the next bgwriter iteration. But not any more than
> that, otherwise we might end up doing more writes than necessary if some
> of the buffers are redirtied.

The purpose of the WAL/shared buffer cache is to avoid having to write
all of the data blocks touched by a transaction to disk before end of
transaction, thus increasing request response time. That purpose is only
fulfilled iff using the shared buffer cache does not require us to write
out someone else's dirty buffers, while avoiding our own. The bgwriter
exists specifically to clean the dirty buffers, so that users do not
have to clean theirs or anybody else's dirty buffers.

> To deal with bursty workloads, for example a batch of 2 GB worth of
> inserts coming in every 10 minutes, it seems we want to keep doing a
> little bit of cleaning even when the system is idle, to prepare for the
> next burst. The idea is to smoothen the physical I/O bursts; if we don't
> clean the dirty buffers left over from the previous burst during the
> idle period, the I/O system will be bottlenecked during the bursts, and
> sit idle otherwise.

In short, bursty workloads are the normal situation.

When capacity is not saturated the bgwriter can utilise the additional
capacity to reduce statement response times.

It is standard industry practice to avoid running a system at peak
throughout for long periods of time, so DBT-2 does not represent a
normal situation. This is because the response times are only
predictable on a non-saturated system and most apps have some implicit
or explicit service level objective.

However, the server needs to cope with periods of saturation, so must be
able to perform efficiently during those times.

So I see there are two modes of operation:

i) dirty block write offload when capacity is available
ii) efficient operation when server is saturated.

DBT-2 represents only the second mode of operation; the two modes are
equally important, yet mode i) is the ideal situation.

> To strike a balance between cleaning buffers ahead of possible bursts in
> the future and not doing unnecessary I/O when no such bursts come, I
> think a reasonable strategy is to write buffers with usage_count=0 at a
> slow pace when there's no buffer allocations happening.

Agreed.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 10:53:04
Message-ID: 1183719184.4488.38.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2007-07-05 at 21:50 +0100, Heikki Linnakangas wrote:

> All test runs were also patched to count the # of buffer allocations,
> and # of buffer flushes performed by bgwriter and backends. Here's those
> results (I hope the intendation gets through properly):
>
> imola-336 imola-337 imola-340
> writes by checkpoint 38302 30410 39529
> writes by bgwriter 350113 2205782 1418672
> writes by backends 1834333 265755 787633
> writes total 2222748 2501947 2245834
> allocations 2683170 2657896 2699974

These results may show that the minimum bgwriter_delay of 10ms may be
too large for the workloads: whatever the strategy used the bgwriter
spends too much time sleeping when it should be working.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 10:55:23
Message-ID: 468E1F9B.2020904@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith wrote:
> On Fri, 6 Jul 2007, Heikki Linnakangas wrote:
>
>> There's something wrong with that. The number of buffer allocations
>> shouldn't depend on the bgwriter strategy at all.
>
> I was seeing a smaller (closer to 5%) increase in buffer allocations
> switching from no background writer to using the stock one before I did
> any code tinkering, so it didn't strike me as odd. I believe it's
> related to the TPS numbers. When there are more transactions being
> executed per unit time, it's more likely the useful blocks will stay in
> memory because their usage_count is getting tickled faster, and
> therefore there's less of the most useful blocks being swapped out only
> to be re-allocated again later.

Did you run the test for a constant number of transactions? If you did,
the access pattern and the number of allocations should be *exactly* the
same with 1 client, assuming the initial state and the seed used for the
random number generator is the same.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 15:44:47
Message-ID: 5244.1183736687@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> buffers_to_clean = Max(buffers_used * 1.1,
>> buffers_to_clean * 0.999);

> That would be overly aggressive on a workload that's steady on average,
> but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0
> 100. You'd end up writing ~100 pages on every bgwriter round, but you
> only need an average of 20 pages per round.

No, you wouldn't be *writing* that many, you'd only be keeping that many
*clean*; which only costs more work if any of them get re-dirtied
between writing and use. Which is a fairly small probability if we're
talking about a small difference in the number of buffers to keep clean.
So I think the average number of writes is hardly different, it's just
that the backends are far less likely to have to do any of them.

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 15:47:19
Message-ID: 468E6407.5030809@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Tom Lane wrote:
>>> buffers_to_clean = Max(buffers_used * 1.1,
>>> buffers_to_clean * 0.999);
>
>> That would be overly aggressive on a workload that's steady on average,
>> but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0
>> 100. You'd end up writing ~100 pages on every bgwriter round, but you
>> only need an average of 20 pages per round.
>
> No, you wouldn't be *writing* that many, you'd only be keeping that many
> *clean*; which only costs more work if any of them get re-dirtied
> between writing and use. Which is a fairly small probability if we're
> talking about a small difference in the number of buffers to keep clean.
> So I think the average number of writes is hardly different, it's just
> that the backends are far less likely to have to do any of them.

Ah, ok, I misunderstood what you were proposing. Yes, that seems like a
good algorithm then.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 16:09:25
Message-ID: 87bqepy0ne.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

>> That would be overly aggressive on a workload that's steady on average,
>> but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0
>> 100. You'd end up writing ~100 pages on every bgwriter round, but you
>> only need an average of 20 pages per round.
>
> No, you wouldn't be *writing* that many, you'd only be keeping that many
> *clean*; which only costs more work if any of them get re-dirtied
> between writing and use. Which is a fairly small probability if we're
> talking about a small difference in the number of buffers to keep clean.
> So I think the average number of writes is hardly different, it's just
> that the backends are far less likely to have to do any of them.

Well Postgres's hint bits tends to redirty pages precisely once at just about
the time when they're ready to be paged out. But I think there are things we
can do to tackle that head-on.

Bgwriter could try to set hint bits before cleaning these pages for example.
Or we could elect in selected circumstances not to write out a page that is
hint-bit-dirty-only. Or some combination of those options depending on the
circumstances. Figuring out the circumstances is the hard part.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 16:15:44
Message-ID: 5908.1183738544@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> On Thu, 5 Jul 2007, Tom Lane wrote:
>> This would give us a safety margin such that buffers_to_clean is not
>> less than the largest demand observed in the last 100 iterations...and
>> it takes quite a while for the memory of a demand spike to be forgotten
>> completely.

> If you tested this strategy even on a steady load, I'd expect you'll find
> there are large spikes in allocations during the occasional period where
> everything is just right to pull a bunch of buffers in, and if you let
> that max linger around for 100 iterations you'll write a large number of
> buffers more than you need.

You seem to have the same misunderstanding as Heikki. What I was
proposing was not a target for how many to *write* on each cycle, but
a target for how far ahead of the clock sweep hand to look. If say
the target is 100, we'll scan forward from the sweep until we have seen
100 clean zero-usage-count buffers; but we only have to write whichever
of them weren't already clean.

This is actually not so different from my previous proposal, in that the
idea is to keep ahead of the sweep by a particular distance. The
previous idea was that that distance was "all the buffers", whereas this
idea is "a moving average of the actual demand rate". The excess writes
created by the previous proposal were because of the probability of
re-dirtying buffers between cleaning and recycling. We reduce that
probability by not trying to keep so many of 'em clean. But I think
that we can meet the goal of having backends do hardly any of the writes
with a relatively small increase in the target distance, and thus a
relatively small differential in the number of wasted writes. Heikki's
test showed that Itagaki-san's patch wasn't doing that well in
eliminating writes by backends, so we need a more aggressive target for
how many buffers to keep clean than it has; but I think not a huge
amount more, and thus my proposal.

BTW, somewhere upthread you suggested combining the target-distance
idea with the idea that the cleaning work uses a separate sweep hand and
thus doesn't re-examine the same buffers on every bgwriter iteration.
The problem is that it'd be very hard to track how far ahead of the
recycling sweep hand we are, because that number has to be measured
in usage-count-zero pages. I see no good way to know how many of the
pages we scanned before have been touched (and given nonzero usage
counts) unless we rescan them.

We could approximate it maybe: try to keep the cleaning hand N total
buffers ahead of the recycling hand, where N is the target number of
clean usage-count-zero buffers scaled by the average fraction of
count-zero buffers (which we can track a moving average of as we advance
the recycling hand). However I'm not sure the complexity and
uncertainty is worth it. What I took away from Heikki's experiment is
that trying to stay a large distance in front of the recycle sweep
isn't actually so useful because you get too many wasted writes due
to re-dirtying. So restructuring the algorithm to make it cheap
CPU-wise to stay well ahead is not so useful either.

> I ended up settling on max(moving average of the last 16,most recent
> allocation), and that seemed to work pretty well without being too
> wasteful from excessive writes.

I've been doing moving averages for years and years, and I find that the
multiplication approach works at least as well as explicitly storing the
last K observations. It takes a lot less storage and arithmetic too.

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 18:08:18
Message-ID: 468E8512.2060102@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> imola-336 imola-337 imola-340
>> writes by checkpoint 38302 30410 39529
>> writes by bgwriter 350113 2205782 1418672
>> writes by backends 1834333 265755 787633
>> writes total 2222748 2501947 2245834
>> allocations 2683170 2657896 2699974
>
>> It looks like Tom's idea is not a winner; it leads to more writes than
>> necessary.
>
> The incremental number of writes is not that large; only about 10% more.
> The interesting thing is that those "extra" writes must represent
> buffers that were re-touched after their usage_count went to zero, but
> before they could be recycled by the clock sweep. While you'd certainly
> expect some of that, I'm surprised it is as much as 10%. Maybe we need
> to play with the buffer allocation strategy some more.
>
> The very small difference in NOTPM among the three runs says that either
> this whole area is unimportant, or DBT2 isn't a good test case for it;
> or maybe that there's something wrong with the patches?

The small difference in NOTPM is because the I/O still wasn't saturated
even with 10% extra writes.

I ran more tests with a higher number of warehouses, and the extra
writes start to show in the response times. See tests 341-344:
http://community.enterprisedb.com/bgwriter/.

I scheduled a test with the moving average method as well, we'll see how
that fares.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 18:57:14
Message-ID: Pine.GSO.4.64.0707061442170.22467@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 6 Jul 2007, Tom Lane wrote:

> The problem is that it'd be very hard to track how far ahead of the
> recycling sweep hand we are, because that number has to be measured
> in usage-count-zero pages. I see no good way to know how many of the
> pages we scanned before have been touched (and given nonzero usage
> counts) unless we rescan them.

I've actually been working on how to address that specific problem without
expressly tracking the contents of the buffer cache. When the background
writer is called, it finds out how many buffers were allocated and how far
the sweep point moved since the last call. From that, you can calculate
how many buffers on average need to be scanned per allocation, which tells
you something about the recently encountered density of 0-usage count
buffers. My thought was to use that as an input to the computation for
how far ahead to stay.

> I've been doing moving averages for years and years, and I find that the
> multiplication approach works at least as well as explicitly storing the
> last K observations. It takes a lot less storage and arithmetic too.

I was simplifying the description just to comment on the range for K; I
was using a multiplication approach for the computation.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 20:10:32
Message-ID: 468EA1B8.9040607@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> I scheduled a test with the moving average method as well, we'll see how
> that fares.

No too well :(.

Strange. The total # of writes is on par with having bgwriter disabled,
but the physical I/O graphs show more I/O (on par with the aggressive
bgwriter), and the response times are higher.

I just noticed that on the tests with the moving average, or the simple
"just enough" method, there's a small bump in the CPU usage during the
ramp up period. I believe that's because bgwriter scans through the
whole buffer cache without finding enough buffers to clean. I ran some
tests earlier with unpatched bgwriter tuned to the maximum, and it used
~10% of CPU, which is the same level that the bump rises to.
Unfortunately I haven't been taking pg_buffercache snapshots until after
the ramp up; it should've shown up there.

I've been running these test with bgwriter_delay of 10 ms, which is
probably too aggressive. I used that to test the idea of starting the
scan from where it left off, instead of always starting from clock hand.

If someone wants to have a look, the # of writes are collected to a
separate log file in <test number>/server/buf_alloc_stats.log. There's
no link to it from the html files. There's also summary snapshots of
pg_buffercache every 30 seconds in <test number>/server/bufcache.log.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-06 21:07:16
Message-ID: Pine.GSO.4.64.0707061701250.28808@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 6 Jul 2007, Heikki Linnakangas wrote:

> I've been running these test with bgwriter_delay of 10 ms, which is probably
> too aggressive.

Even on relatively high-end hardware, I've found it hard to get good
results out of the BGW with the delay under 50ms--particularly when trying
to do some more complicated smoothing.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-07 08:02:07
Message-ID: Pine.GSO.4.64.0707060616490.3474@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 6 Jul 2007, Heikki Linnakangas wrote:

> To strike a balance between cleaning buffers ahead of possible bursts in the
> future and not doing unnecessary I/O when no such bursts come, I think a
> reasonable strategy is to write buffers with usage_count=0 at a slow pace
> when there's no buffer allocations happening.

One idea I had there was to always scan max_pages buffers each time even
if there were less allocations than needed for that. That number is
usually relatively small compared to the size of the buffer cache, so it
would creep through the buffer cache at a bounded pace during idle
periods. It's actually nice to watch the LRU cleaner get so far ahead
during idle spots that it catches the strategy point, so that when the
next burst comes, it doesn't have to do anything until there's a full lap
by the clock sweep.

Anyway, completely with you on the rest of this post, everything you said
matches the direction I've been trudging toward.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-11 10:01:21
Message-ID: 4694AA71.7040701@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In the last couple of days, I've been running a lot of DBT-2 tests and
smaller microbenchmarks with different bgwriter settings and
experimental patches, but I have not been able to produce a repeatable
test case where any of the bgwriter configurations perform better than
not having bgwriter at all.

I encountered a strange phenomenon that I don't understand. I ran a
small test case with DELETEs in random order, using an index, on a table
~300MB table, with shared_buffers smaller than that. I expected that to
be dominated by the speed postgres can swap pages in and out of the
shared buffer cache, but surprisingly the test starts to block on the
write I/O, even though the table fits completely in OS cache. I was able
to reproduce the phenomenon with a simple C program that writes 8k
blocks in random order to a fixed size file. I've attached it along with
output of running it on my test server. The output shows how the writes
start to periodically block after a while. I was able to reproduce the
problem on my laptop as well. Can anyone explain what's going on?

Anyone out there have a repeatable test case where bgwriter helps?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
writetest.c text/x-csrc 1.0 KB
writetest-out text/plain 1.3 KB

From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-11 11:51:37
Message-ID: 2e78013d0707110451v358cf338uaef9896c4acfd7d3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/11/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> I was able
> to reproduce the phenomenon with a simple C program that writes 8k
> blocks in random order to a fixed size file. I've attached it along with
> output of running it on my test server. The output shows how the writes
> start to periodically block after a while. I was able to reproduce the
> problem on my laptop as well. Can anyone explain what's going on?
>
>
>
I think you are assuming that the next write of the same block won't
use another OS cache block. I doubt if thats the way writes are handled
by the kernel. Each write would typically end up being queued up in the
kernel
where each write will have its own copy of the block to the written. Isn't
it ?

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-11 13:08:11
Message-ID: 20070711130811.GC3294@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavan Deolasee escribió:
> On 7/11/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> >
> >I was able
> >to reproduce the phenomenon with a simple C program that writes 8k
> >blocks in random order to a fixed size file. I've attached it along with
> >output of running it on my test server. The output shows how the writes
> >start to periodically block after a while. I was able to reproduce the
> >problem on my laptop as well. Can anyone explain what's going on?
> >
> I think you are assuming that the next write of the same block won't
> use another OS cache block. I doubt if thats the way writes are handled
> by the kernel. Each write would typically end up being queued up in the
> kernel
> where each write will have its own copy of the block to the written. Isn't
> it ?

I don't think so -- at least not on Linux. See
https://ols2006.108.redhat.com/2007/Reprints/zijlstra-Reprint.pdf
where he talks about a patch to the page cache. He describes the
current page cache there; each page is kept on a tree, so a second write
to the same page would "overwrite" the page of the original write.

--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
"Las mujeres son como hondas: mientras más resistencia tienen,
más lejos puedes llegar con ellas" (Jonas Nightingale, Leap of Faith)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter strategies
Date: 2007-07-11 15:16:32
Message-ID: 25511.1184166992@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com> writes:
> I think you are assuming that the next write of the same block won't
> use another OS cache block. I doubt if thats the way writes are handled
> by the kernel. Each write would typically end up being queued up in the
> kernel
> where each write will have its own copy of the block to the written. Isn't
> it ?

A kernel that worked like that would have a problem doing read(), ie,
it'd have to search to find the latest version of the block. So I'd
expect that most systems would prefer to keep only one in-memory copy
of any given block and overwrite it at write() time. No sane kernel
designer will optimize write() at the expense of read() performance,
especially when you consider that a design as above really pessimizes
write() too --- it does more I/O than is necessary when the same block
is modified repeatedly in a short time.

regards, tom lane