Quick Links

Re: Final background writer cleanup for 8.3

Lists:	pgsql-hackers

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Final background writer cleanup for 8.3
Date:	2007-08-24 02:13:12
Message-ID:	Pine.GSO.4.64.0708232004540.15073@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

In the interest of closing work on what's officially titled the "Automatic
adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I
think this is at, what I'm working on right now, and see if feedback from
that changes how I submit my final attempt for a useful patch in this area
this week. Hopefully there are enough free eyes to stare at this now to
wrap up a plan for what to do that makes sense and still fits in the 8.3
schedule. I'd hate to see this pushed off to 8.4 without making some
forward progress here after the amount of work done already, particularly
when odds aren't good I'll still be working with this code by then.

Let me start with a summary of the conclusions I've reached based on my
own tests and the set that Heikki did last month (last results at
http://community.enterprisedb.com/bgwriter/ ); Heikki will hopefully chime
in if he disagrees with how I'm characterizing things:

1) In the current configuration, if you have a large setting for
bgwriter_lru_percent and/or a small setting for bgwriter_delay, that can
be extremely wasteful because the background writer will consume
CPU/locking resources scanning the buffer pool needlessly. This problem
should go away.

2) Having backends write their own buffers out does not significantly
degrade performance, as those turn into cached OS writes which generally
execute fast enough to not be a large drag on the backend.

3) Any attempt to scan significantly ahead of the current strategy point
will result in some amount of premature writes that decreases overall
efficiency in cases where the buffer is touched again before it gets
re-used. The more in advance you go, the worse this inefficiency is.
The most efficient way for many workloads is to just let the backends do
all the writes.

4) Tom observed that there's no reason to ever scan the same section of
the pool more than once, because anything that changes a buffer's status
will always make it un-reusable until the strategy point has passed over
it. But because of (3), this does not mean that one should drive forward
constantly trying to lap the buffer pool and catch up with the strategy
point.

5) There hasn't been any definitive proof that the background writer is
helpful at all in the context of 8.3. However, yanking it out altogether
may be premature, as there are some theorized ways that it may be helpful
in real-world situations with more intermittant workloads than are
generally encountered in a benchmarking situation. I personally feel that
is some potential for the BGW to become more useful in the context of the
8.4 release if it starts doing things like adding pages it expects to be
recycled soon onto the free list, which could improve backend efficiency
quite a bit compared to the current situation where each backend is
normally running their own scan. But that's a bit too big to fit into 8.3
I think.

What I'm aiming for here is to have the BGW do as little work as possible,
as efficiently as possible, but not remove it altogether. (2) suggests
that this approach won't decrease performance compared to the current 8.2
situation, where I've seen evidence some are over-tuning to have a very
aggressive BGW scan an enormous amount of the pool each time because they
have resources to burn. Having a generally self-tuning background writer
that errs on the lazy side stay in the codebase satisfies (5). Here is
what the patch I'm testing right now does to try and balance all this out:

A) Counters are added to pg_stat_bgwriter that show how many buffers were
written by the backends, by the background writer, how many times
bgwriter_lru_maxpages was hit, and the total number of buffers allocated.
This at least allows monitoring what's going on as people run their own
experiments. Heikki's results included data using the earlier version of
this patch I put assembled (which now conflicts with HEAD, I have an
updated one).

B) bgwriter_lru_percent is removed as a tunable. This eliminates (1).
The idea of scanning a fixed percentage doesn't ever make sense given the
observations above; we scan until we accomplish the cleaning mission
instead.

C) bgwriter_lru_maxpages stays as an absolute maximum number of pages that
can be written in one sweep each bgwriter_delay. This allows easily
turning the writer off altogether by setting it to 0, or limiting how
active it tries to be in situations where (3) is a concern. Admins can
monitor the amount that the max is hit in pg_stat_bgwriter and consider
raising it (or lowering the delay) if it proves to be too limiting. I
think the default needs to be bumped to something more like 100 rather
than the current tiny one before the stock configuration can be considered
"self-tuning" at all.

D) The strategy code gets a "passes" count added to it that serves as a
sort of high-order int for how many times the buffer cache has been looked
over in its entirety.

E) When the background writer start the LRU cleaner, it checks if the
strategy point has passed where it last cleaned up to, using the
passes+buf_id "pointer". If so, it just starts cleaning from the strategy
point as it always has. But if it's still ahead it just continues from
there, thus implementing the core of (4)'s insight. It estimates how many
buffers are probably clean in the space between the strategy point and
where it's starting at, based on how far ahead it is combined with
historical data about how many buffers are scanned on average per reusable
buffer found (the exact computation of this number is the main thing I'm
still fiddling with).

F) A moving average of buffer allocations is used to predict how many
clean buffers are expected to be needed in the next delay cycle. The
original patch from Itagaki doubled the recent allocations to pad this
out; (3) suggests that's too much.

G) Scan the buffer pool until either
--Enough reusable buffers have been located or written out to fill the
upcoming allocation need, taking into account the estimate from (E); this
is the normal expected way the scan will terminate.
--We've written bgwriter_lru_maxpages
--We "lap" and catch the strategy point

In addition to removing a tunable and making the remaining two less
critical, one of my hopes here is that the more efficient way this scheme
operates will allow using much smaller values for bgwriter_delay than have
been practical in the current codebase, which may ultimately have its own
value.

That's what I've got working here now, still need some more tweaking and
testing before I'm done with the code but there's not much left. The main
problem I forsee is that this approach is moderately complicated, adding a
lot of new code and regular+static variables, for something that's not
really proven to be valuable. I will not be surprised if my patch is
rejected on that basis. That's why I wanted to get the big picture
painted in this message while I finish up the work necessary to submit it,
'cause if the whole idea is doomed anyway I might as well stop now.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-24 03:09:35
Message-ID:	16801.1187924975@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> In the interest of closing work on what's officially titled the "Automatic
> adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I
> think this is at ...

> 2) Having backends write their own buffers out does not significantly
> degrade performance, as those turn into cached OS writes which generally
> execute fast enough to not be a large drag on the backend.

[ itch... ] That assumption scares the heck out of me. It is doubtless
true in a lightly loaded system, but once the kernel is under any kind
of memory pressure I think it's completely wrong. I think designing the
system around this assumption will lead to something that performs great
as long as you're not pushing it hard.

However, your actual specific proposals do not seem to rely on this
assumption extensively, so I wonder why you are emphasizing it.

The only parts of your specific proposals that I find a bit dubious are

> ... It estimates how many
> buffers are probably clean in the space between the strategy point and
> where it's starting at, based on how far ahead it is combined with
> historical data about how many buffers are scanned on average per reusable
> buffer found (the exact computation of this number is the main thing I'm
> still fiddling with).

If you're still fiddling with it then you probably aren't going to get
it right in the next few days. Perhaps you should think about whether
this can be left out entirely for 8.3 and revisited later.

> F) A moving average of buffer allocations is used to predict how many
> clean buffers are expected to be needed in the next delay cycle. The
> original patch from Itagaki doubled the recent allocations to pad this
> out; (3) suggests that's too much.

Maybe you need to put back the eliminated tuning parameter in the form
of the scaling factor to be used here. I don't like 1.0, mainly because
I don't believe your assumption (2). I'm willing to concede that 2.0
might be too much, but I don't know where in between is the sweet spot.

Also, we might need a tuning parameter for the reaction speed of the
moving average --- what are you using for that?

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-24 05:17:46
Message-ID:	Pine.GSO.4.64.0708240033520.20246@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 23 Aug 2007, Tom Lane wrote:

> It is doubtless true in a lightly loaded system, but once the kernel is
> under any kind of memory pressure I think it's completely wrong.

The fact that so many tests I've done or seen get maximum throughput in
terms of straight TPS with the background writer turned completely off is
why I stated that so explicitly. I understand what you're saying in terms
of memory pressure, all I'm suggesting is that the empirical tests suggest
the current background writer even with moderate improvements doesn't
necessarily help when you get there. If writes are blocking, whether the
background writer does them slightly ahead of time or whether the backend
does them itself doesn't seem to matter very much. On a heavily loaded
system, your throughput is bottlenecked at the disk either way--and
therefore it's all the more important in those cases to never do a write
until you absolutely have to, lest it be wasted.

> If you're still fiddling with it then you probably aren't going to get
> it right in the next few days.

The implementation is fine most of the time, I've just found some corner
cases in testing I'd like to improve stability on (mainly how best to
handle when no buffers were allocated during the previous period, some
small concerns about the first pass over the pool). What I'm thinking of
doing is taking a couple of my assumptions/techniques and turning them
into things that can be turned on or off with #DEFINE, that way the parts
of the code that people don't like are easy to identify and pull out.
I've already done with that with one section.

> Maybe you need to put back the eliminated tuning parameter in the form
> of the scaling factor to be used here. I don't like 1.0, mainly because
> I don't believe your assumption (2). I'm willing to concede that 2.0
> might be too much, but I don't know where in between is the sweet spot.

That would be easy to implement and add some flexibility, so I'll do that.
bgwriter_lru_percent becomes bgwriter_lru_multiplier, possibly to be
renamed later if someone comes up with a snappier name.

> Also, we might need a tuning parameter for the reaction speed of the
> moving average --- what are you using for that?

It's hard-coded at 16 samples. Seemed stable around 10-20, picked 16 in
so maybe it will optimize usefully to a bit shift. On the reaction side,
it actually reacts faster than that--if the most recent allocation is
greater than the average, it uses that instead. The number of samples has
more of an impact on the trailing side, and accordingly isn't that
critical.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-24 12:02:10
Message-ID:	87lkc12la5.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Greg Smith <gsmith(at)gregsmith(dot)com> writes:
>> In the interest of closing work on what's officially titled the "Automatic
>> adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I
>> think this is at ...
>
>> 2) Having backends write their own buffers out does not significantly
>> degrade performance, as those turn into cached OS writes which generally
>> execute fast enough to not be a large drag on the backend.
>
> [ itch... ] That assumption scares the heck out of me. It is doubtless
> true in a lightly loaded system, but once the kernel is under any kind
> of memory pressure I think it's completely wrong. I think designing the
> system around this assumption will lead to something that performs great
> as long as you're not pushing it hard.

I think Heikki's experiments showed it wasn't true for at least some kinds of
heavy loads. However I would expect it to depend heavily on just what kind of
load the machine is under. At least if it's busy writing then I would expect
it to throttle writes. Perhaps in TPCC there are enough reads to throttle the
write rate to something the kernel can buffer.

> If you're still fiddling with it then you probably aren't going to get
> it right in the next few days. Perhaps you should think about whether
> this can be left out entirely for 8.3 and revisited later.

How does all of this relate to your epiphany that we should just have bgwriter
be a full clock sweep ahead clock hand without retracing its steps?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-24 12:41:19
Message-ID:	46CED1EF.8010707@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
>> Greg Smith <gsmith(at)gregsmith(dot)com> writes:
>>> In the interest of closing work on what's officially titled the "Automatic
>>> adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I
>>> think this is at ...
>>> 2) Having backends write their own buffers out does not significantly
>>> degrade performance, as those turn into cached OS writes which generally
>>> execute fast enough to not be a large drag on the backend.
>> [ itch... ] That assumption scares the heck out of me. It is doubtless
>> true in a lightly loaded system, but once the kernel is under any kind
>> of memory pressure I think it's completely wrong. I think designing the
>> system around this assumption will lead to something that performs great
>> as long as you're not pushing it hard.
>
> I think Heikki's experiments showed it wasn't true for at least some kinds of
> heavy loads. However I would expect it to depend heavily on just what kind of
> load the machine is under. At least if it's busy writing then I would expect
> it to throttle writes. Perhaps in TPCC there are enough reads to throttle the
> write rate to something the kernel can buffer.

I ran a bunch of DBT-2 in different configurations, as well as simple
single-threaded tests like random DELETEs on a table with index, steady
rate of INSERTs to a table with no indexes, and bursts of INSERTs with
different bursts sizes and delays between them. I tried the tests with
different bgwriter settings, including turning it off and with the patch
applied, and with different shared_buffers settings.

I was not able to find a test where turning bgwriter on performed better
than turning it off.

If anyone out there has a repeatable test case where bgwriter does help,
I'm all ears. The theory of moving the writes out of the critical path
does sound reasonable, so I'm sure there is test case to demonstrate the
effect, but it seems to be pretty darn hard to find.

The cold, rational side of me says we need a test case to show the
benefit, or if one can't be found, we should remove bgwriter altogether.
The emotional side of me tells me we can't go that far. A reasonable
compromise would be to apply the autotuning patch on the grounds that it
removes a GUC variable that's next to impossible to tune right, even
though we can't show a performance benefit compared to bgwriter=off. And
it definitely makes sense not to restart the scan from the clock sweep
hand on each bgwriter round; as Tom pointed out, it's a waste of time.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Greg Smith" <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-24 15:24:53
Message-ID:	235.1187969093@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> How does all of this relate to your epiphany that we should just have
> bgwriter be a full clock sweep ahead clock hand without retracing its
> steps?

Well, it's still clearly silly for the bgwriter to rescan buffers it's
already cleaned. But I think we've established that the "keep a lap
ahead" idea goes too far, because it writes dirty buffers speculatively,
long before they actually are needed, and there's just too much chance
of the writes being wasted due to re-dirtying. When proposing that
idea I had supposed that wasted writes wouldn't hurt much, but that's
evidently wrong.

Heikki makes a good point nearby that if you are not disk write
bottlenecked then it's perfectly OK for backends to issue writes,
as it'll just result in a transfer to kernel cache space, and no actual
wait for I/O. And if you *are* write-bottlenecked, then the last thing
you want is any wasted writes. So a fairly conservative strategy that
does bgwrites only "just in time" seems like what we need to aim at.

I think the moving-average-of-requests idea, with a user-adjustable
scaling factor, is the best we have at the moment.

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-24 15:53:37
Message-ID:	Pine.GSO.4.64.0708241147001.27606@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 24 Aug 2007, Tom Lane wrote:

> Heikki makes a good point nearby that if you are not disk write
> bottlenecked then it's perfectly OK for backends to issue writes, as
> it'll just result in a transfer to kernel cache space, and no actual
> wait for I/O. And if you *are* write-bottlenecked, then the last thing
> you want is any wasted writes.

Which is the same thing I was saying in my last message, so I'm content
we're all on the same page here now--and that the contents of that page
are now clear in the archives for when this comes up again.

> So a fairly conservative strategy that does bgwrites only "just in time"
> seems like what we need to aim at.

And that's exactly what I've been building. Feedback and general feeling
that I'm doing the right thing appreciated, am returning to the code with
scaling factor as a new tunable but plan otherwise unchanged.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-24 15:59:35
Message-ID:	46CEBA17.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> On Fri, Aug 24, 2007 at 7:41 AM, in message
<46CED1EF(dot)8010707(at)enterprisedb(dot)com>, "Heikki Linnakangas"
<heikki(at)enterprisedb(dot)com> wrote:
> I was not able to find a test where turning bgwriter on performed better
> than turning it off.

Any tests which focus just on throughput don't address the problems which
caused us so much grief. What we need is some sort of test which generates
a moderate write load in the background, while paying attention to the
response time of a large number of read-only queries. The total load should
not be enough to saturate the I/O bandwidth overall if applied evenly.

The problem which the background writer has solved for us is that we have
three layers of caching (PostgreSQL, OS, and RAID controller), each with its
own delay before writing; when something like fsync triggers a cascade from
one cache to the next, the write burst bottlenecks the I/O, and reads exceed
acceptable response times. The two approaches which seem to prevent this
problem are to disable all OS delays in writing dirty pages, or to minimize
the delays in PostgreSQL writing dirty pages.

Throughput is not everything. Response time matters.

> If anyone out there has a repeatable test case where bgwriter does help,
> I'm all ears.

All we have is a production system where PostgreSQL failed to perform at a
level acceptable to the users without it.

> The cold, rational side of me says we need a test case to show the
> benefit, or if one can't be found, we should remove bgwriter altogether.

I would be fine with that if I could configure the back end to always write a
dirty page to the OS when it is written to shared memory. That would allow
Linux and XFS to do their job in a timely manner, and avoid this problem.

I know we're doing more in 8.3 to move this from the OS's realm into
PostgreSQL code, but until I have a chance to test that, I want to make sure
that what has been proven to work for us is not broken.

-Kevin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-24 17:37:53
Message-ID:	2925.1187977073@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Any tests which focus just on throughput don't address the problems which
> caused us so much grief.

This is a good point: a steady-state load is either going to be in the
regime where you're not write-bottlenecked, or the one where you are;
and either way the bgwriter isn't going to look like it helps much.

The real use of the bgwriter, perhaps, is to smooth out a varying load
so that you don't get pushed into the write-bottlenecked mode during
spikes. We've already had to rethink the details of how we made that
happen with respect to preventing checkpoints from causing I/O spikes.
Maybe LRU buffer flushes need a rethink too.

Right at the moment I'm still comfortable with what Greg is doing, but
there's an argument here for a more aggressive scaling factor on
number-of-buffers-to-write than he thinks. Still, as long as we have a
GUC variable in there, tuning should be possible.

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-24 22:47:13
Message-ID:	Pine.GSO.4.64.0708241807500.28499@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 24 Aug 2007, Kevin Grittner wrote:

> I would be fine with that if I could configure the back end to always write a
> dirty page to the OS when it is written to shared memory. That would allow
> Linux and XFS to do their job in a timely manner, and avoid this problem.

You should take a look at the "io storm on checkpoints" thread on the
pgsql-performance(at)postgresql(dot)org started by Dmitry Potapov on 8/22 if you
aren't on that list. He was running into the same problem as you (and me
and lots of other people) and had an interesting resolution based on
turning the Linux kernel so that it basically stopped caching writes.
What you suggest here would be particularly inefficient because of how
much extra I/O would happen on the index blocks involved in the active
tables.

> I know we're doing more in 8.3 to move this from the OS's realm into
> PostgreSQL code, but until I have a chance to test that, I want to make sure
> that what has been proven to work for us is not broken.

The background writer code that's in 8.2 can be configured as a big
sledgehammer that happens to help in this area while doing large amounts
of collateral damage via writing things prematurely. Some of the people
involved in the 8.3 code rewrite and testing were having the same problem
as you on a similar scale--I recall Greg Stark commenting that he had a
system that was freezing for a full 30 seconds the way yours was.

I would be extremely surprised to find that the code that's already in 8.3
isn't a big improvement over what you're doing now based on how much it
has helped others running into this issue. And much of the code that
you're relying on now to help with the problem (the all-scan portion of
the BGW) has already been removed as part of that.

Switching to my Agent Smith voice: "No Kevin, your old background writer
is already dead". You'd have to produce some really unexpected and
compelling results during the beta period for it to get put back again.
The work I'm still doing here is very much fine-tuning in comparision to
what's already been committed into 8.3.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-26 04:07:29
Message-ID:	46D0B630.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> On Fri, Aug 24, 2007 at 5:47 PM, in message
<Pine(dot)GSO(dot)4(dot)64(dot)0708241807500(dot)28499(at)westnet(dot)com>, Greg Smith
<gsmith(at)gregsmith(dot)com> wrote:
> On Fri, 24 Aug 2007, Kevin Grittner wrote:
>
>> I would be fine with that if I could configure the back end to always write
> a
>> dirty page to the OS when it is written to shared memory. That would allow
>> Linux and XFS to do their job in a timely manner, and avoid this problem.
>
> You should take a look at the "io storm on checkpoints" thread on the
> pgsql-performance(at)postgresql(dot)org started by Dmitry Potapov on 8/22 if you
> aren't on that list. He was running into the same problem as you (and me
> and lots of other people) and had an interesting resolution based on
> turning the Linux kernel so that it basically stopped caching writes.

I saw it. I think that I'd rather have a write-through cache in PostgreSQL
than give up OS caching entirely. The problem seems to be caused by the
cascade from one cache to the next, so I can easily believe that disabling
the delay on either one solves the problem.

> What you suggest here would be particularly inefficient because of how
> much extra I/O would happen on the index blocks involved in the active
> tables.

I've certainly seen that assertion on these lists often. I don't think I've
yet seen any evidence that it's true. When I made the background writer
more aggressive, there was no discernible increase in disk writes at the OS
level (much less from controller cache to the drives). This may not be true
with some of the benchmark software, but in our environment there tends to
be a lot of activity on a singe court case, and then they're done with it.
(I spent some time looking at this to tune our heuristics for generating
messages on our interfaces to business partners.)

>> I know we're doing more in 8.3 to move this from the OS's realm into
>> PostgreSQL code, but until I have a chance to test that, I want to make sure
>> that what has been proven to work for us is not broken.
>
> The background writer code that's in 8.2 can be configured as a big
> sledgehammer that happens to help in this area while doing large amounts
> of collateral damage via writing things prematurely.

Again -- to the OS cache, where it sits and accumulates other changes until
the page settles.

> I would be extremely surprised to find that the code that's already in 8.3
> isn't a big improvement over what you're doing now based on how much it
> has helped others running into this issue.

I'm certainly hoping that it will be. I'm not moving to it for production
until I've established that as a fact, however.

> And much of the code that
> you're relying on now to help with the problem (the all-scan portion of
> the BGW) has already been removed as part of that.
>
> Switching to my Agent Smith voice: "No Kevin, your old background writer
> is already dead". You'd have to produce some really unexpected and
> compelling results during the beta period for it to get put back again.

If I fail to get resources approved to test during beta, this could become
an issue later, when we do get around to testing it. (There's exactly zero
chance of us moving to something which so radically changes a problem area
for us without serious testing.)

For what it's worth, the background writer settings I'm using weren't
arrived at entirely randomly. I monitored I/O during episodes of the
database freezing up, and looked at how many writes per second were going
through. I then reasoned that there was no good reason NOT to push data out
from PostgreSQL to the OS at that speed. I split the writes between the LRU
and full cache aspects of the background writer, with heavier weight given
to getting all dirty pages pushed out to the OS cache so that they could
start to age through the OS timers. (While the raw numbers totaled to the
peak write load, I figured I was actually allowing some slack, since there
was the percentage limit and the two scans would often cover the same
ground, not to mention the assumption that the interval was a sleep time
from the end of one run to the start of the next.) Since it was a
production system, I made incremental changes each day, and each day the
problem became less severe. At the point where I finally set it to my
calculated numbers, we stopped seeing the problem.

I'm not entirely convinced that it's a sound assumption that we should
always try to keep some dirty buffers in the cache on the off chance that
we might be smarter than the OS/FS/RAID controller algorithms about when to
write them. That said, the 8.3 changes sound as though they are likely to
reduce the problems with I/O-related freezes.

Is it my imagination, or are we coming pretty close to the point where we
could accomadate the oft-requested feature of dealing directly with a raw
volume, rather than going through the file system at all?

-Kevin

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-26 05:51:52
Message-ID:	Pine.GSO.4.64.0708260115400.14470@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 25 Aug 2007, Kevin Grittner wrote:

> in our environment there tends to be a lot of activity on a singe court
> case, and then they're done with it.

I submitted a patch to 8.3 that lets contrib/pg_buffercache show the
usage_count data for each of the buffers. It's actually pretty tiny; you
might consider applying just that patch to your 8.2 production system and
installing the module (as an add-in, it's easy enough to back out). See
http://archives.postgresql.org/pgsql-patches/2007-03/msg00555.php

With that patch in place, try a query like

select usagecount,count(*),isdirty from pg_buffercache group by
isdirty,usagecount order by isdirty,usagecount;

That lets you estimate how much waste would be involved for your
particular data if you wrote it out early--the more high usage_count
blocks in there cache, the worse the potential waste. With the tests I
was running, the hot index blocks were pegged at the maximum count allowed
(5) and they were taking up around 20% of the buffer cache. If those were
written out every time they were touched, it would be a bad scene.

It sounds like your system has a lot of data where the usage_count would
be much lower on average, which would explain why you've been so
successful with resolving it using the background writer. That's a
slightly easier problem to solve than the one I've been banging on.

> I'm not moving to it for production until I've established that as a
> fact, however.

And you'd be crazy to do otherwise.

> I'm not entirely convinced that it's a sound assumption that we should
> always try to keep some dirty buffers in the cache on the off chance that
> we might be smarter than the OS/FS/RAID controller algorithms about when to
> write them.

All I can say is that every time someone had tried to tune the code toward
writing that much more proactively, the results haven't seemed like an
improvement. I wouldn't characterize it as an assumption--it's a theory
that seems to hold every time it's tested. At least on the kind of Linux
systems people put into production right now (which often have relatively
old kernels), the OS is not as smart as everyone would like to to be in
this area.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-26 08:41:52
Message-ID:	87k5riy9f3.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:

> Is it my imagination, or are we coming pretty close to the point where we
> could accomadate the oft-requested feature of dealing directly with a raw
> volume, rather than going through the file system at all?

Or O_DIRECT.

I think the answer is that we've built enough intelligence that it's feasible
from the memory management side.

However there's another side to that problem. a) you would either need to have
multiple bgwriters or have bgwriter use aio since having only one would
serialize your i/o which would be a big hit to i/o bandwidth. b) you need some
solution to handle preemptively reading ahead for sequential reads.

I don't think we're terribly far off from being able to do it. The traditional
response has always been that our time is better spent doing database stuff
rather than reimplementing what the OS people are doing better. And also that
the OS has more information about the hardware and so can schedule I/O more
efficiently.

However there's also a strong counter-argument that we have more information
about what we're intending to use the data for and how urgent any given i/o is
so.

I'm not sure how that balancing act ends. I have a hunch but I guess it would
take experiments to get a real answer. And the answer might be very different
on different OSes and hardware configurations.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-26 19:15:24
Message-ID:	46D18AFB.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> On Sun, Aug 26, 2007 at 12:51 AM, in message
<Pine(dot)GSO(dot)4(dot)64(dot)0708260115400(dot)14470(at)westnet(dot)com>, Greg Smith
<gsmith(at)gregsmith(dot)com> wrote:
> On Sat, 25 Aug 2007, Kevin Grittner wrote:
>
>> in our environment there tends to be a lot of activity on a singe court
>> case, and then they're done with it.
>
> I submitted a patch to 8.3 that lets contrib/pg_buffercache show the
> usage_count data for each of the buffers. It's actually pretty tiny; you
> might consider applying just that patch to your 8.2 production system and
> installing the module (as an add-in, it's easy enough to back out). See
> http://archives.postgresql.org/pgsql-patches/2007-03/msg00555.php
>
> With that patch in place, try a query like
>
> select usagecount,count(*),isdirty from pg_buffercache group by
> isdirty,usagecount order by isdirty,usagecount;
>
> That lets you estimate how much waste would be involved for your
> particular data if you wrote it out early--the more high usage_count
> blocks in there cache, the worse the potential waste. With the tests I
> was running, the hot index blocks were pegged at the maximum count allowed
> (5) and they were taking up around 20% of the buffer cache. If those were
> written out every time they were touched, it would be a bad scene.

Just to be sure that I understand, are you saying it would be a bad scene if
the physical writes happened, or that the overhead of pushing them out to
the OS would be crippling?

Anyway, I've installed this on the machine that I proposed using for the
tests. It is our older generation of central servers, soon to be put to
some less critical use as we bring the newest generation on line and the
current "new" machines fall back to secondary roles in our central server
pool. It is currently a replication target for the 72 county-based circuit
court systems, but is just there for ad hoc queries against statewide data;
there's no web load present.

Running the suggested query a few times, with the samples separated by a few
seconds each, I got the following. (The Sunday afternoon replication load
is unusual in that there will be very few users entering any data, just a
trickle of input from our law enforcement interfaces, but a lot of the
county middle tiers will have noticed that there is idle time and that it
has been more than 23 hours since the start of the last synchronization of
county data against the central copies, and so will be doing massive selects
to look for and report any "drift".) I'll check again during normal weekday
load.

usagecount | count | isdirty
------------+-------+---------
0 | 8711 | f
1 | 9394 | f
2 | 1188 | f
3 | 869 | f
4 | 160 | f
5 | 157 | f
| 1 |
(7 rows)

usagecount | count | isdirty
------------+-------+---------
0 | 9033 | f
1 | 8849 | f
2 | 1623 | f
3 | 619 | f
4 | 181 | f
5 | 175 | f
(6 rows)

usagecount | count | isdirty
------------+-------+---------
0 | 9093 | f
1 | 6702 | f
2 | 2267 | f
3 | 602 | f
4 | 428 | f
5 | 1388 | f
(6 rows)

usagecount | count | isdirty
------------+-------+---------
0 | 6556 | f
1 | 7188 | f
2 | 3648 | f
3 | 2074 | f
4 | 720 | f
5 | 293 | f
| 1 |
(7 rows)

usagecount | count | isdirty
------------+-------+---------
0 | 6569 | f
1 | 7855 | f
2 | 3942 | f
3 | 1181 | f
4 | 532 | f
5 | 401 | f
(6 rows)

I also ran the query mentioned in the cited email about 100 times, with 52
instead of 32. (I guess I have a bigger screen.) It would gradually go
from entirely -1 values to mostly -2 with a few -1, then gradually back to
all -1. Repeatedly. I never saw anything other than -1 or -2. Of course
this is with our aggressive background writer settings.

This contrib module seems pretty safe, patch and all. Does anyone think
there is significant risk to slipping it into the 8.2.4 database where we
have massive public exposure on the web site handling 2 million hits per
day?

By the way, Greg, lest my concerns about this be misinterpreted -- I do
really appreciate the effort you've put into analyzing this and tuning the
background writer. I just want to be very cautious here, and I do get
downright alarmed at some of the posts which seem to deny the reality of the
problems which many have experienced with write spikes choking off reads to
the point of significant user impact. I also think we need to somehow
develop a set of tests which report maximum response time on (what should
be) fast queries while the database is under different loads, so that those
of us for whom reliable response time is more important than maximum overall
throughput are protected from performance regressions.

-Kevin

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-26 21:16:15
Message-ID:	Pine.GSO.4.64.0708261637030.3811@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 26 Aug 2007, Kevin Grittner wrote:

> usagecount | count | isdirty
> ------------+-------+---------
> 0 | 8711 | f
> 1 | 9394 | f
> 2 | 1188 | f
> 3 | 869 | f
> 4 | 160 | f
> 5 | 157 | f

Here's a typical sample from your set. Notice how you've got very few
buffers with a high usage count. This is a situation the background
writer is good at working with. Either the old or new work-in-progress
LRU writer can aggressively pound away at any of the buffers with a 0
usage count shortly after they get dirty, and that won't be inefficient
because there aren't large numbers of other clients using them.

Compare against this other sample:

> usagecount | count | isdirty
> ------------+-------+---------
> 0 | 9093 | f
> 1 | 6702 | f
> 2 | 2267 | f
> 3 | 602 | f
> 4 | 428 | f
> 5 | 1388 | f

Notice that you have a much larger number of buffers where the usage count
is 4 or 5. The all-scan part of the 8.2 background writer will waste a
lot of writes when you have a profile that's more like this. If there
have been 4+ client backends touching the buffer recently, you'd be crazy
to write it out right now if you could instead be focusing on banging out
the ones where the usage count is 0. The 8.2 background writer would
write them out anyway, which meant that when you hit a checkpoint both the
OS and the controller cache were filled with such buffers before you even
started writing the checkpoint data. The new setup in 8.3 only worries
about the high usage count buffers when you hit a checkpoint, at which
point it streams them out over a longer, adjustable period (as not to
spike the I/O more than necessary and block your readers) than the 8.2
design, which just dumped them all immediately.

> Just to be sure that I understand, are you saying it would be a bad scene if
> the physical writes happened, or that the overhead of pushing them out to
> the OS would be crippling?

If you have a lot of buffers where the usage_count data was high, it would
be problematic to write them out every time they were touched; odds are
good somebody else is going to dirty them again soon enough so why bother.
On your workload, that doesn't seem to be the case. But that is the
situation on some other test workloads, and balancing for that situation
has been central to the parts of the redesign I've been injecting
suggestions into. One of the systems I was tormented by had the
usagecount of 5 for >20% of the buffers in the cache under heavy load, and
had a physical write been executed every time one of those was touched
that would have been crippling (even if the OS was smart to cache and
therefore make redundant some of the writes, which is behavior I would
prefer not to rely on).

> This contrib module seems pretty safe, patch and all. Does anyone think
> there is significant risk to slipping it into the 8.2.4 database where we
> have massive public exposure on the web site handling 2 million hits per
> day?

I think it's fairly safe, and my patch was pretty small; just exposing
some data that nobody had been looking at before. Think how much easier
your life would have been when doing your earlier tuning if you were
looking at the data in these terms. Just be aware that running the query
is itself intensive and causes its own tiny hiccup in throughput every
time it executes, so you may want to consider this more of a snapshot you
run periodically to learn more about your data rather than something you
do very regularly.

> I also think we need to somehow develop a set of tests which report
> maximum response time on (what should be) fast queries while the
> database is under different loads, so that those of us for whom reliable
> response time is more important than maximum overall throughput are
> protected from performance regressions.

My guess is that the DBT2 tests that Heikki has been running are a more
complicated than you think they are; there are response time guarantee
requirements in there as well as the throughput numbers. The tests that I
run (which I haven't been publishing yet but will be with the final patch
soon) also report worst-case and 90-th percentile latency numbers as well
as TPS. A "regression" that improved TPS at the expense of those two
would not be considered an improvement by anyone involved here.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-27 00:35:29
Message-ID:	46D1D601.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> On Sun, Aug 26, 2007 at 4:16 PM, in message
<Pine(dot)GSO(dot)4(dot)64(dot)0708261637030(dot)3811(at)westnet(dot)com>, Greg Smith
<gsmith(at)gregsmith(dot)com> wrote:
> On Sun, 26 Aug 2007, Kevin Grittner wrote:
>
>> usagecount | count | isdirty
>> ------------+-------+---------
>> 0 | 9093 | f
>> 1 | 6702 | f
>> 2 | 2267 | f
>> 3 | 602 | f
>> 4 | 428 | f
>> 5 | 1388 | f
>
> Notice that you have a much larger number of buffers where the usage count
> is 4 or 5. The all-scan part of the 8.2 background writer will waste a
> lot of writes when you have a profile that's more like this. If there
> have been 4+ client backends touching the buffer recently, you'd be crazy
> to write it out right now if you could instead be focusing on banging out
> the ones where the usage count is 0.

Seems to me I'd be crazy to be writing out anything. Nothing's dirty.

In fact, I ran a simple query to count dirty pages once per second for a
minute and had three sample show any pages dirty. The highest count was 5.
Again, this was Sunday afternoon, which is not traditionally a busy time for
the courts. I'll try to get some more meaningful numbers tomorrow.

> One of the systems I was tormented by had the
> usagecount of 5 for >20% of the buffers in the cache under heavy load, and
> had a physical write been executed every time one of those was touched
> that would have been crippling (even if the OS was smart to cache and
> therefore make redundant some of the writes, which is behavior I would
> prefer not to rely on).

Why is that?

> The tests that I
> run (which I haven't been publishing yet but will be with the final patch
> soon) also report worst-case and 90-th percentile latency numbers as well
> as TPS. A "regression" that improved TPS at the expense of those two
> would not be considered an improvement by anyone involved here.

Have you been able to create a test case which exposes the write-spike
problem under 8.2.4?

By the way, the 90th percentile metric isn't one I'll care a lot about.
In our environment any single instance of a "fast" query running slow is
considered a problem, and my job is to keep those users happy.

-Kevin

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-27 00:52:40
Message-ID:	87fy25yf1j.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Greg Smith" <gsmith(at)gregsmith(dot)com> writes:

> On Sun, 26 Aug 2007, Kevin Grittner wrote:
>
>> I also think we need to somehow develop a set of tests which report maximum
>> response time on (what should be) fast queries while the database is under
>> different loads, so that those of us for whom reliable response time is more
>> important than maximum overall throughput are protected from performance
>> regressions.
>
> My guess is that the DBT2 tests that Heikki has been running are a more
> complicated than you think they are; there are response time guarantee
> requirements in there as well as the throughput numbers. The tests that I run
> (which I haven't been publishing yet but will be with the final patch soon)
> also report worst-case and 90-th percentile latency numbers as well as TPS. A
> "regression" that improved TPS at the expense of those two would not be
> considered an improvement by anyone involved here.

TPCC requires that the 90th percentile response time be under 5s for most
transactions. It also requires that the average be less than the 90th
percentile which helps rule out circumstances where the longest 10% response
times are *much* longer than 5s.

However in practice neither of those requirements really rule out some pretty
bad behaviour as long as it's rare enough. Before the distributed checkpoint
patch went in we were finding 60s of zero activity at every checkpoint. But
there were so few transactions affected that in the big picture it didn't
impact the 90th percentile. It didn't even affect the 95th percentile. I think
you had to look at the 99th percentile before it even began to impact the
results.

I can't really imagine a web site operator being happy if he was told that
only 1% of user's clicks resulted in a browser timeout...

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Greg Smith" <gsmith(at)gregsmith(dot)com>, "Kevin Grittner" <Kgrittn(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-27 16:31:29
Message-ID:	46D2B610.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> On Sun, Aug 26, 2007 at 7:35 PM, in message
<46D1D601(dot)EE98(dot)0025(dot)0(at)wicourts(dot)gov>, "Kevin Grittner"
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>>>> On Sun, Aug 26, 2007 at 4:16 PM, in message
> <Pine(dot)GSO(dot)4(dot)64(dot)0708261637030(dot)3811(at)westnet(dot)com>, Greg Smith
> <gsmith(at)gregsmith(dot)com> wrote:
> I'll try to get some more meaningful numbers tomorrow.

Well, I ran the query against the production web server 40 times, and the highest number I got for usagecount 5 dirty pages was in this sample:

usagecount | count | isdirty
------------+-------+---------
0 | 7358 | f
1 | 7428 | f
2 | 1938 | f
3 | 1311 | f
4 | 1066 | f
5 | 1097 | f
1 | 87 | t
2 | 62 | t
3 | 31 | t
4 | 11 | t
5 | 86 | t
| 5 |
(12 rows)

Most samples looked something like this:

usagecount | count | isdirty
------------+-------+---------
0 | 7981 | f
1 | 6584 | f
2 | 1975 | f
3 | 1063 | f
4 | 1366 | f
5 | 1294 | f
0 | 5 | t
1 | 83 | t
2 | 60 | t
3 | 19 | t
4 | 21 | t
5 | 28 | t
| 1 |
(13 rows)

The system can comfortably write out about 4,000 pages per second as long as the write cache doesn't get swamped, so in the worst case I caught it had 69 ms worth of work to do, if they were all physical writes (which, of course, is highly unlikely).

>From shortly afterwards, possibly of interest:

postgres(at)ATHENA:~> vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 3 20 402248 0 10538028 0 0 0 1 1 2 21 4 55 19
2 4 20 403116 0 10538028 0 0 5180 384 2233 9599 24 5 50 21
3 6 20 402868 0 10532888 0 0 4844 512 2841 14054 44 6 31 19
7 10 20 397908 0 10534944 0 0 6768 465 2674 11995 40 6 26 28
4 15 20 398016 0 10534944 0 0 3344 4703 2297 10578 34 7 13 46
0 22 20 405456 0 10534944 0 0 2464 4192 1785 6167 20 3 21 56
14 19 20 401852 0 10538028 0 0 3680 4704 2474 11779 29 5 12 54
17 13 20 401728 0 10532888 0 0 5504 1945 2554 21490 35 8 10 47
3 10 20 408176 0 10530832 0 0 11380 553 3907 15463 67 13 5 15
4 4 20 405572 0 10535972 0 0 8708 981 2904 12051 26 7 34 33
1 5 20 403588 0 10535972 0 0 5924 464 2589 12194 26 5 45 23
4 7 20 410780 0 10529804 0 0 6284 1163 2674 11830 33 8 35 24
3 13 20 402596 0 10526720 0 0 2424 6598 2441 10332 40 7 11 42
7 16 20 400736 0 10528776 0 0 3928 6784 2453 9852 26 6 26 42
19 14 20 405308 0 10524664 0 0 2272 4708 2208 8583 27 5 19 49
9 17 20 404580 0 10527748 0 0 7156 3560 3185 13203 55 11 3 32
1 11 20 406192 0 10531860 0 0 5112 3647 2758 11362 31 6 26 37
3 13 20 404464 0 10531860 0 0 4856 3426 2342 11077 24 5 35 36
2 13 20 403968 0 10530832 0 0 5308 4634 2762 15778 34 7 22 36
4 12 20 403472 0 10534944 0 0 2996 3766 2090 9331 20 4 34 42
0 5 20 412648 0 10522608 0 0 2364 5187 1816 5194 18 5 56 22
4 13 20 415376 0 10519524 0 0 2836 6172 1929 5075 25 6 26 43
27 16 20 413880 0 10522608 0 0 7892 2340 3325 19769 52 8 10 30
7 7 20 402340 0 10530832 0 0 7600 712 3511 16486 45 8 20 26
4 9 20 403704 0 10531860 0 0 7708 830 3133 16164 43 11 22 24
5 6 20 408416 0 10529804 0 0 6900 814 2703 10806 31 7 39 24
8 6 20 401844 0 10532888 0 0 6884 632 2993 13792 37 7 29 27
13 3 20 398868 0 10534944 0 0 7732 744 3443 14580 63 9 8 19
5 6 20 403580 0 10533916 0 0 6724 623 2905 11937 37 7 34 22
3 7 20 400728 0 10529804 0 0 6924 712 2746 12085 35 7 37 21
0 7 20 408664 0 10526720 0 0 6536 344 2562 10555 27 6 44 24
5 1 20 407796 0 10527748 0 0 4628 1000 2653 13092 41 7 37 15
7 9 20 400480 0 10529804 0 0 3364 744 2326 11198 35 7 40 18
3 4 20 406384 0 10531860 0 0 4044 904 2998 14055 60 9 16 14
18 5 20 397976 0 10525692 0 0 6000 671 3082 14058 55 10 15 20
11 6 20 410996 0 10528776 0 0 4828 3498 2768 13027 38 7 28 27
1 3 20 406416 0 10531860 0 0 4140 616 2496 11980 33 6 43 17

This box is a little beefier than the proposed test box, with 8 3 GHz Xeon MP CPUs and 12 GB of RAM. Other than telling PostgreSQL about the extra RAM in the effective cache size GUC, this box has the same postgresql.conf.

Other than cranking up the background writer settings this is the same box and configuration that stalled so badly that we were bombarded with user complaints.

-Kevin

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-31 12:35:47
Message-ID:	46D80B23.2060500@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8/24/2007 1:17 AM, Greg Smith wrote:
> On Thu, 23 Aug 2007, Tom Lane wrote:
>
>> It is doubtless true in a lightly loaded system, but once the kernel is
>> under any kind of memory pressure I think it's completely wrong.
>
> The fact that so many tests I've done or seen get maximum throughput in
> terms of straight TPS with the background writer turned completely off is
> why I stated that so explicitly. I understand what you're saying in terms
> of memory pressure, all I'm suggesting is that the empirical tests suggest
> the current background writer even with moderate improvements doesn't
> necessarily help when you get there. If writes are blocking, whether the
> background writer does them slightly ahead of time or whether the backend
> does them itself doesn't seem to matter very much. On a heavily loaded
> system, your throughput is bottlenecked at the disk either way--and
> therefore it's all the more important in those cases to never do a write
> until you absolutely have to, lest it be wasted.

Have you used something that like a properly implemented TPC benchmark
simulates users that go through cycles of think times instead of
hammering SUT interactions at the maximum possible rate allowed by the
network latency? And do your tests consider any completed transaction a
good transaction, or are they like TPC benchmarks, which require the
majority of transactions to complete in a certain maximum response time?

Those tests will show you that inflicting an IO storm at checkpoint time
will delay processing enough to get a significant increase in the number
of concurrent transactions by giving the "users" time enough to come out
of their thinking time. That spike in active transactions increases
pressure on CPU, memory and IO ... and eventually leads to the situation
where users submit new transactions at a higher rate than you currently
can commit ... which is where you enter the spiral of death.

Observing that very symptom during my TPC-W tests several years ago was
what lead to developing the background writer in the first place. Can
your tests demonstrate improvements for this kind of (typical web
application) load profile?

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-31 12:46:28
Message-ID:	46D80DA4.8010107@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8/24/2007 8:41 AM, Heikki Linnakangas wrote:
> If anyone out there has a repeatable test case where bgwriter does help,
> I'm all ears. The theory of moving the writes out of the critical path
> does sound reasonable, so I'm sure there is test case to demonstrate the
> effect, but it seems to be pretty darn hard to find.

One could try to dust off this TPC-W benchmark.

http://pgfoundry.org/projects/tpc-w-php/

Again, the original theory for the bgwriter wasn't moving writes out of
the critical path, but smoothing responsetimes that tended to go
completely down the toilet during checkpointing, causing all the users
to wake up and overload the system entirely.

It is well known that any kind of bgwriter configuration other than OFF
does increase the total IO cost. But you will find that everyone who has
SLA's that define maximum response times will happily increase the IO
bandwidth to give an aggressively configured bgwriter room to work.

Jan

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-08-31 16:16:49
Message-ID:	Pine.GSO.4.64.0708311127530.1643@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 31 Aug 2007, Jan Wieck wrote:

> Again, the original theory for the bgwriter wasn't moving writes out of the
> critical path, but smoothing responsetimes that tended to go completely down
> the toilet during checkpointing, causing all the users to wake up and
> overload the system entirely.

As far as I'm concerned, that function of the background writer has been
replaced by the load distributed checkpoint features now controlled by
checkpoint_completion_target, which is believed to be a better solution in
several respects. I'm been trying to motivate people happily using the
current background writer to confirm or deny that during beta, while
there's still time to put the all-scan portion that was removed back
again.

The open issue I'm working on is whether the LRU cleaner running in
advance of the Strategy point is still a worthwhile addition on top of
that.

My own tests with pgbench that I'm busy wrapping up today haven't provided
many strong conclusions here; the raw data is now on-line at
http://www.westnet.com/~gsmith/content/bgwriter/ , am working on
summarizing it usefully and bundling the toolchain I used to run all
those. I'll take a look at whether TCP-W provides a helpfully different
view here because as far as I'm aware that's a test neither myself or
Heikki has tried yet to investigate this area.

> It is well known that any kind of bgwriter configuration other than OFF does
> increase the total IO cost. But you will find that everyone who has SLA's
> that define maximum response times will happily increase the IO bandwidth to
> give an aggressively configured bgwriter room to work.

The old background writer couldn't be configured to be aggressive enough
to satisfy some SLAs because of interactions with the underlying operating
system write caches. It actually made things worse in some situations
because at the point when you hit a checkpoint, the OS/disk controller
caches were already filled to capacity with writes of active pages, many
of which were now being written again. Had you just left the background
writer off those caches would have had less data in them and better been
able to absorb the storm of writes that come with the checkpoint. This is
particularly true in the situtation where you have a large caching disk
controller that might chew GB worth of shared_buffers almost instantly
were it mostly clean when the checkpoint storm begins, but if the
background writer has been busy pounding at it then it's already full of
data at checkpoint time.

We just talked about this for a bit at Bruce's back in July; the hardware
you did your development against and what people are deploying nowadays
are so different that the entire character of the problem has changed.
The ability of the processors and memory to create dirty pages has gone up
by at least one order of magnitude, and the sophistication of the disk
controller on a high-end PostgreSQL server is pretty high now; the speed
of the underlying disks haven't kept pace, and that gap has been making
this particular problem worse every year.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-09-04 21:11:04
Message-ID:	200709041411.05479.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg,

> As far as I'm concerned, that function of the background writer has been
> replaced by the load distributed checkpoint features now controlled by
> checkpoint_completion_target, which is believed to be a better solution
> in several respects. I'm been trying to motivate people happily using
> the current background writer to confirm or deny that during beta, while
> there's still time to put the all-scan portion that was removed back
> again.

In about 200 benchmark test runs, I don't feel like we ever came up with a
set of bgwriter settings we'd happily recommend to others. SO it's hard
for me to tell whether this is true or not.

> The open issue I'm working on is whether the LRU cleaner running in
> advance of the Strategy point is still a worthwhile addition on top of
> that.
>
> My own tests with pgbench that I'm busy wrapping up today haven't
> provided many strong conclusions here; the raw data is now on-line at
> http://www.westnet.com/~gsmith/content/bgwriter/ , am working on
> summarizing it usefully and bundling the toolchain I used to run all
> those. I'll take a look at whether TCP-W provides a helpfully different
> view here because as far as I'm aware that's a test neither myself or
> Heikki has tried yet to investigate this area.

Can you send me the current version of the patch, plus some bgwriter
settings to try with it, so we can throw it on some of the Sun benchmarks?

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-09-05 16:28:06
Message-ID:	Pine.GSO.4.64.0709051222390.27984@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 4 Sep 2007, Josh Berkus wrote:

> In about 200 benchmark test runs, I don't feel like we ever came up with a
> set of bgwriter settings we'd happily recommend to others. SO it's hard
> for me to tell whether this is true or not.

Are you talking about 200 runs with 8.2.4 or 8.3? If you've collected a
bunch of 8.3 data, that's something I haven't been able to do; if what
you're saying is that you never found settings with 8.2.4 that you'd
recommend, that's consistant with what I was saying.

> Can you send me the current version of the patch, plus some bgwriter
> settings to try with it, so we can throw it on some of the Sun benchmarks?

Am in the middle of wrapping this up today, will send out a patch for
everyone to try shortly. Tests are done, patch is done for now, just
writing the results up and making my tests reproducible. I had some
unexpected inspiration the other day that dragged things out, but with
useful improvements.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-09-05 16:52:40
Message-ID:	200709050952.40221.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg,

> Are you talking about 200 runs with 8.2.4 or 8.3?

8.2.4.

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-09-05 18:54:26
Message-ID:	Pine.GSO.4.64.0709051443300.17248@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 5 Sep 2007, Josh Berkus wrote:

>> Are you talking about 200 runs with 8.2.4 or 8.3?
> 8.2.4.

Right, then we're in agreement here. I did something like 4000 small test
runs with dozens of settings under various 8.2.X releases and my
conclusion was that in the general case, it just didn't work at reducing
checkpoint spikes the way it was supposed to. Your statement that you
never found a "set of bgwriter settings we'd happily recommend to others"
was also the case for me.

While there certainly are some cases where we've heard about people whose
workloads were such that the background writer worked successfully for
them, I consider those lucky rather than normal. I'd like those people to
test 8.3 because I'd hate to see the changes made to improve the general
case cause a regression for them.

You are certainly spot-on that this causes a bit of a problem for testing
8.3 in beta, because if you come from a world-view where the 8.2.4
background writer was never successful it's hard to figure out a starting
point for comparing it to the one in 8.3. Maybe I'll spark some ideas
when I get the rest of my data out here soon.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Greg Smith" <gsmith(at)gregsmith(dot)com>,<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Final background writer cleanup for 8.3
Date:	2007-09-05 20:22:08
Message-ID:	46DEC99E.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> On Wed, Sep 5, 2007 at 1:54 PM, in message
<Pine(dot)GSO(dot)4(dot)64(dot)0709051443300(dot)17248(at)westnet(dot)com>, Greg Smith
<gsmith(at)gregsmith(dot)com> wrote:
> On Wed, 5 Sep 2007, Josh Berkus wrote:
>
> While there certainly are some cases where we've heard about people whose
> workloads were such that the background writer worked successfully for
> them, I consider those lucky rather than normal. I'd like those people to
> test 8.3 because I'd hate to see the changes made to improve the general
> case cause a regression for them.

Being one of the lucky ones, I'm still hopeful that I'll be able to do
these tests. I think I know how to tailor the load so that we see the
problem often enough to get useful benchmarks (we tended to see the
problem a few times per day in actual 24/7 production).

My plan would be to run 8.2.4 with the background writer turned off to
establish a baseline. I think that any test, to be meaningful would need
to run for several hours, with the first half hour discarded as just being
enough to establish the testing state.

Then I would test our aggressive background writer settings under 8.2.4 to
confirm that those settings do handle the problem in this test
environment.

Then I would test the new background writer with synchronous commits under
the 8.3 beta, using various settings. The 0.5, 0.7 and 0.9 settings you
recommended for a test are how far from the LRU end of the cache to look
for dirty pages to write, correct? Is there any upper bound, as long as I
keep it below 1? Are the current shared memory and the 1 GB you suggested
enough of a spread for these tests? (At several hours per test in order
to get meaningful results, I don't want to get into too many permutations.)

Finally, I would try the new checkpoint techniques, with and without the
new background writer. Any suggestions on where to set the knobs for
those runs?

I'm inclined to think that it would be interesting to try the benchmarks
with the backend writing any dirty page through to the OS at the same time
they are written to the PostgreSQL cache, as a reference point at the
opposite extreme from having the cache hold onto dirty pages for as long
as possible before sharing them with the OS. Do you see any value in
getting actual numbers for that?

> this causes a bit of a problem for testing
> 8.3 in beta, because if you come from a world-view where the 8.2.4
> background writer was never successful it's hard to figure out a starting
> point for comparing it to the one in 8.3.

In terms of comparing the new technique to the old, one would approach the
new technique by turning off the "all" scan and setting the lru scan
percentage to 50% or more, right? (I mean, obviously there would be more
CPU time used as it scanned through clean pages repeatedly, but it would
be a rough analogy otherwise, yes?)

-Kevin

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Testing 8.3 LDC vs. 8.2.4 with aggressive BGW
Date:	2007-09-11 05:06:34
Message-ID:	Pine.GSO.4.64.0709071630180.10175@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Renaming the old thread to more appropriately address the topic:

On Wed, 5 Sep 2007, Kevin Grittner wrote:

> Then I would test the new background writer with synchronous commits under
> the 8.3 beta, using various settings. The 0.5, 0.7 and 0.9 settings you
> recommended for a test are how far from the LRU end of the cache to look
> for dirty pages to write, correct?

This is alluding to the suggestions I gave at
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00755.php

checkpoint_completion_target has nothing to do with the LRU, so let's step
back to fundamentals and talk about what it actually does. The official
documentation is at
http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html

As you generate transactions, Postgres puts data into the WAL. The WAL is
organized into segments that are typically 16MB each. Periodically, the
system hits a checkpoint where the WAL data up to a certain point is
guaranteed to have been applied to the database, at which point the old
WAL files aren't needed anymore and can be reused. These checkpoints are
generally caused by one of two things happening:

1) checkpoint_segments worth of WAL files have been written
2) more than checkpoint_timeout seconds have passed since the last
checkpoint

The system doesn't stop working while the checkpoint is happening; it just
keeps creating new WAL files. As long as the checkpoint finishes in
advance of what the next one is required things performance should be
fine.

In the 8.2 model, processing the checkpoint occurs as fast as data can be
written to disk. In 8.3, the writes can be spread out instead. What
checkpoint_completion_target does is suggest how far along the system
should aim to have finished the current checkpoint relative to when the
next one is expected.

For example, your current system has checkpoint_segments=10. Assume that
you have checkpoint_timeout set to a large number such that the
checkpoints are typically being driven by the number of segments being
filled (so you get a checkpoint every 10 WAL segments, period). If
checkpoint_completion_target was set to 0.5, the expectation is that the
writes for the currently executing checkpoint would be finished about the
time that 0.5*10=5 segments of new WAL data had been written. If you set
it to 0.9 instead, you'd expect the checkpoint is finishing just about
when the 9th WAL segment is being written out, which is cutting things a
bit tight; somewhere around there is the safe upper limit for that
parameter.

Now, checkpoint_segments=10 is a pretty low setting, but I'm guessing that
on your current system that's forcing very regular checkpoints, which
makes each individual checkpoint have less work to do and therefore
reduces the impact of the spikes you're trying to avoid. With LDC and
checkpoint_completion_target, you can make that number much bigger (I
suggested 50), which means you'll only have 1/5 as many checkpoints
causing I/O spikes, and each of those checkpoints will have 5X as long to
potentially spread the writes over. The main cost is that it will take
longer to recover if your database crashes, which hopefully is a rare
event.

Having far less checkpoints is obviously a win for your situation, but the
open question is whether this fashion of spreading them out will reduce
the I/O spike as effectively as the all-scan background writer in 8.2 has
been working for you. This is one aspect that makes your comparision a
bit tricky. It's possible that by increasing the segments enough, you'll
get into a situation where you don't see (m)any of them during your
testing run of 8.3. You should try and collect some data on how regularly
checkpoints are happening during early testing to get an idea if this is a
possibility. The usual approach is to set checkpoint_warning to a really
high number (like the maximum of 3600) and then you'll get a harmless note
in the logs every time one happens, and that will show you how frequently
they're happening. It's kind of important to have an idea how many
checkpoints you can expect during each test run to put together a fair
comparison; as you increase checkpoint_segments, you need to adopt a
mindset that is considering "how many sluggish transactions am I seeing
per checkpoint?", not how many total per test run.

I have a backport of some of the pg_stat_bgwriter features added to 8.3
that can be applied to 8.2 that might be helpful for monitoring your test
benchmarking server (this is most certainly *not* suitable to go onto the
real one) at
http://www.westnet.com/~gsmith/content/postgresql/perfmon82.htm you might
want to take a look at; I put that together specifically for allowing
easier comparisions of 8.2 and 8.3 in this area.

> Are the current shared memory and the 1 GB you suggested enough of a
> spread for these tests? (At several hours per test in order to get
> meaningful results, I don't want to get into too many permutations.)

Having a much larger shared_buffers setting should allow you to keep more
data in memory usefully, which may lead to an overall performance gain due
to improved efficiency. With your current configuration, I would guess
that making the buffer cache bigger would increase the checkpoint spike
problems, where that shouldn't be as much of a problem with 8.3 because of
how the checkpoint can be spread out. The hope here is that by letting
PostgreSQL cache more and avoiding writes of popular buffers except at
checkpoint time, your total I/O will be significantly lower with 8.3
compared to how much an aggressive BGW will write in 8.2. Right now,
you've got a pretty low number of pages that accumulate a high usage
count; that may change if you give the buffer cache a lot more room to
work.

> Finally, I would try the new checkpoint techniques, with and without the
> new background writer. Any suggestions on where to set the knobs for
> those runs?

This and your related question about simulating the new LRU behavior by
"turning off the 'all' scan and setting the lru scan percentage to 50% or
more" depend on what final form the LRU background writer ends up in.
Certainly you should consider using a higher value for the percentage and
maxpages parameters with the current form 8.3 is in because you're not
having the all scan doing the majority of the work anymore. If some form
of my JIT BGW patch gets applied before beta, you'll still want to
increase maxpages but don't have to play with the percentage anymore; you
might try adjusting the multiplier setting instead.

> I'm inclined to think that it would be interesting to try the benchmarks
> with the backend writing any dirty page through to the OS at the same
> time they are written to the PostgreSQL cache, as a reference point at
> the opposite extreme from having the cache hold onto dirty pages for as
> long as possible before sharing them with the OS. Do you see any value
> in getting actual numbers for that?

It might be an interesting curiousity to see how this works for you, but
I'm not sure of its value to the community at large. The configuration
trend for larger systems seems to be pretty clear at this point: use
large values for shared_buffers and checkpoint_segments. Minimize total
I/O in the background writer by not writing more than you have to, only
even consider writing buffers that are going to be reused in the near
future regularly; everything else only gets written out at checkpoint
time. I consider the fact that you've gotten good results in the past by
a radically different configuration than what's considered normal best
practice, a configuration that works around problems in 8.2, an
interesting data point. I don't see any reason that anyone would jump
from there to expecting that turning the PostgreSQL cache into what's
essentially a write-through one the way you describe here will be helpful
in most cases, and I'm not sure how you would do it anyway.

What I would encourage you to take a look at while you're doing these
experiments is radically lowering the Linux dirty_background_ratio tunable
(perhaps even to 0) to see what that does for you. From what I've seen in
the past, the caching there is more likely to be the root of your problem.
Hopefully LDC will address your issue such that you don't have to adjust
this, because it will lower efficiency considerably, but it may be the
most straightforward way to get the more timely I/O path you're obviously
looking for.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD