Bgwriter LRU cleaning: we've been going at this all wrong

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-26 20:24:55
Message-ID: 28084.1182889495@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I just had an epiphany, I think.

As I wrote in the LDC discussion,
http://archives.postgresql.org/pgsql-patches/2007-06/msg00294.php
if the bgwriter's LRU-cleaning scan has advanced ahead of freelist.c's
clock sweep pointer, then any buffers between them are either clean,
or are pinned and/or have usage_count > 0 (in which case the bgwriter
wouldn't bother to clean them, and freelist.c wouldn't consider them
candidates for re-use). And *this invariant is not destroyed by the
activities of other backends*. A backend cannot dirty a page without
raising its usage_count from zero, and there are no race cases because
the transition states will be pinned.

This means that there is absolutely no point in having the bgwriter
re-start its LRU scan from the clock sweep position each time, as
it currently does. Any pages it revisits are not going to need
cleaning. We might as well have it progress forward from where it
stopped before.

In fact, the notion of the bgwriter's cleaning scan being "in front of"
the clock sweep is entirely backward. It should try to be behind the
sweep, ie, so far ahead that it's lapped the clock sweep and is trailing
along right behind it, cleaning buffers immediately after their
usage_count falls to zero. All the rest of the buffer arena is either
clean or has positive usage_count.

This means that we don't need the bgwriter_lru_percent parameter at all;
all we need is the lru_maxpages limit on how much I/O to initiate per
wakeup. On each wakeup, the bgwriter always cleans until either it's
dumped lru_maxpages buffers, or it's caught up with the clock sweep.

There is a risk that if the clock sweep manages to lap the bgwriter,
the bgwriter would stop upon "catching up", when in reality there are
dirty pages everywhere. This is easily prevented though, if we add
to the shared BufferStrategyControl struct a counter that is incremented
each time the clock sweep wraps around to buffer zero. (Essentially
this counter stores the high-order bits of the sweep counter.) The
bgwriter can then recognize having been lapped by comparing that counter
to its own similar counter. If it does get lapped, it should advance
its work pointer to the current sweep pointer and try to get ahead
again. (There's no point in continuing to clean pages behind the sweep
when those just ahead of it are dirty.)

This idea changes the terms of discussion for Itagaki-san's
automatic-adjustment-of-lru_maxpages patch. I'm not sure we'd still
need it at all, as lru_maxpages would now be just an upper bound on the
desired I/O rate, rather than the target itself. If we do still need
such a patch, it probably needs to look a lot different than it does
now.

Comments?

regards, tom lane


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-26 21:27:16
Message-ID: Pine.GSO.4.64.0706261640570.24678@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 26 Jun 2007, Tom Lane wrote:

> It should try to be behind the sweep, ie, so far ahead that it's lapped
> the clock sweep and is trailing along right behind it, cleaning buffers
> immediately after their usage_count falls to zero. All the rest of the
> buffer arena is either clean or has positive usage_count.

I've said before here that something has to fundamentally change with the
LRU writer for it to ever be really useful, because most of the time it's
executing over pages with a positive usage_count as you say here. One
idea I threw out before was to have it premptively lower the usage counts
as it scans ahead of the sweep point and then add the pages to the free
list, which you rightly had some issues with. This suggestion of a change
so you'd expect it to follow right behind the sweep point sounds like a
better plan that should result in even less client back-end writes, and I
really like a plan that finally casts the LRU writer control parameter in
a MB/s context.

(Some pointers to your comments when we've gone over this neighborhood
before: http://archives.postgresql.org/pgsql-hackers/2007-03/msg00642.php
http://archives.postgresql.org/pgsql-hackers/2007-04/msg00799.php )

I broke Itagaki-san's patch into two pieces when I was doing the review
cleanup on it specifically to make it easier to tinker with this part
without losing some of its other neat features. Heikki, did you do
anything with that LRU adjustment patch since I sent it out:
http://archives.postgresql.org/pgsql-patches/2007-05/msg00142.php

I already fixed the race condition bug you found in my version of the
code.

Unless someone else has a burning desire to implement Tom's idea faster
than me, I should be to build this new implementation myself in the next
couple of days. I still have the test environment leftover from the last
time I worked on this code, and I think everybody else who could handle
this job has more important higher-level things they could be working on
instead.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-26 22:31:52
Message-ID: 537.1182897112@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> Unless someone else has a burning desire to implement Tom's idea faster
> than me, I should be to build this new implementation myself in the next
> couple of days.

Sure, go for it. I'm going to work next on committing the LDC patch,
but I'll try to avoid modifying any of the code involved in the LRU
scan, so as to minimize merge problems for you. Now that we have a new
plan for this, I think we can just omit any of the parts of the LDC
patch that might have touched that code.

I realized on re-reading that I'd misstated the conditions slightly:
any time the cleaning scan falls behind the clock sweep at all (not
necessarily a whole lap) it should forcibly advance its pointer to the
current sweep position. This would mainly be relevant right at bgwriter
startup, when it's starting from the sweep position and trying to get
ahead; it might easily not be able to, until there's a lull in the
demand for new buffers. (So until that happens, the changed code would
work just the same as now: write the first lru_maxpages dirty buffers
in front of the sweep point.) The main point of this change is that when
there is a lull, the bgwriter will exploit it to get ahead, rather than
sitting on its thumbs as it does today ...

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org, Greg Smith <gsmith(at)gregsmith(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-26 22:40:29
Message-ID: 468195DD.7070900@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I just had an epiphany, I think.
>
> As I wrote in the LDC discussion,
> http://archives.postgresql.org/pgsql-patches/2007-06/msg00294.php
> if the bgwriter's LRU-cleaning scan has advanced ahead of freelist.c's
> clock sweep pointer, then any buffers between them are either clean,
> or are pinned and/or have usage_count > 0 (in which case the bgwriter
> wouldn't bother to clean them, and freelist.c wouldn't consider them
> candidates for re-use). And *this invariant is not destroyed by the
> activities of other backends*. A backend cannot dirty a page without
> raising its usage_count from zero, and there are no race cases because
> the transition states will be pinned.
>
> This means that there is absolutely no point in having the bgwriter
> re-start its LRU scan from the clock sweep position each time, as
> it currently does. Any pages it revisits are not going to need
> cleaning. We might as well have it progress forward from where it
> stopped before.

All true this far.

Note that Itagaki-san's patch changes that though. With the patch, the
LRU scan doesn't look for bgwriter_lru_maxpages dirty buffers to write.
Instead, it checks that there's N (where N varies based on history)
clean buffers with usage_count=0 in front of the clock sweep. If there
isn't, it writes dirty buffers until there is again.

> In fact, the notion of the bgwriter's cleaning scan being "in front of"
> the clock sweep is entirely backward. It should try to be behind the
> sweep, ie, so far ahead that it's lapped the clock sweep and is trailing
> along right behind it, cleaning buffers immediately after their
> usage_count falls to zero. All the rest of the buffer arena is either
> clean or has positive usage_count.

Really? How much of the buffer cache do you think we should try to keep
clean? And how large a percentage of the buffer cache do you think have
usage_count=0 at any given point in time? I'm not sure myself, but as a
data point the usage counts on a quick DBT-2 test on my laptop look like
this:

usagecount | count
------------+-------
0 | 1107
1 | 1459
2 | 459
3 | 235
4 | 352
5 | 481
| 3

NBuffers = 4096.

That will vary widely depending on your workload, of course, but keeping
1/4 of the buffer cache clean seems like overkill to me. If any of those
buffers are re-dirtied after we write them, the write was a waste of time.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-26 22:58:36
Message-ID: 46819A1C.1020801@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith wrote:
> I broke Itagaki-san's patch into two pieces when I was doing the review
> cleanup on it specifically to make it easier to tinker with this part
> without losing some of its other neat features. Heikki, did you do
> anything with that LRU adjustment patch since I sent it out:
> http://archives.postgresql.org/pgsql-patches/2007-05/msg00142.php

I like the idea of breaking down the patch into two parts, though I
didn't like the bitmasked return code stuff in that first patch.

I haven't worked on that patch. I started looking at this, using
Itagaki's patch as the basis. In fact, as Tom posted his radical idea, I
was writing down my thoughts on the bgwriter patch:

I think regardless of the details of how bgwriter should work, the
design is going have three parts:

Part 1: Keeping track of how many buffers have been requested by
backends since last bgwriter round.

Part 2: An algorithm to turn that number into desired # of clean buffers
we should have in front of the clock hand. That could include storing
some historic data to use in the calculation.

Part 3: A way to check that we have that many clean buffers in front of
the clock hand. We might not do that exactly, an approximation would be
enough.

Itagaki's patch attached implements part 1 in the obvious way. A trivial
implementation for part 2 is (desired # of clean buffers) = (buffers
requested since last round). For part 3, we start from current clock
hand and scan until we've seen/cleaned enough unpinned buffers with
usage_count = 0, or until we reach bgwriter_lru_percent.

I think we're good with part 1, but I'm sure everyone has their
favourite idea for 2 and 3. Let's hear them now.

> Unless someone else has a burning desire to implement Tom's idea faster
> than me, I should be to build this new implementation myself in the next
> couple of days. I still have the test environment leftover from the
> last time I worked on this code, and I think everybody else who could
> handle this job has more important higher-level things they could be
> working on instead.

Oh, that would be great! Since you have the test environment ready, can
you try alternative patches as well as they're proposed?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-27 02:31:50
Message-ID: 20070627110859.6410.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:

> Tom Lane wrote:
> > In fact, the notion of the bgwriter's cleaning scan being "in front of"
> > the clock sweep is entirely backward. It should try to be behind the
> > sweep, ie, so far ahead that it's lapped the clock sweep and is trailing
> > along right behind it, cleaning buffers immediately after their
> > usage_count falls to zero. All the rest of the buffer arena is either
> > clean or has positive usage_count.
>
> That will vary widely depending on your workload, of course, but keeping
> 1/4 of the buffer cache clean seems like overkill to me. If any of those
> buffers are re-dirtied after we write them, the write was a waste of time.

Agreed intuitively, but I don't know how offen backends change usage_count
0 to 1. If the rate is high, backward-bgwriter would not work. It seems to
happen frequently when we use large shared buffers.

I read Tom is changing the bgwriter LRU policy from "clean dirty pages
recycled soon" to "clean dirty pages just when they turn out to be less
frequently used", right? I have another thought -- advancing bgwriter's
sweep-startpoint a little ahead.

[buf] 0 lru X bgw-start N
|-----|----------->|-----------------------------|

I think X=0 is in the current behavior and X=N is in the backward-bgwriter.
Are there any other appropriate values for X? It might be good to use
statistics information about buffer usage to modify X runtime.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-27 04:48:25
Message-ID: Pine.GSO.4.64.0706270021100.10954@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 26 Jun 2007, Heikki Linnakangas wrote:

> How much of the buffer cache do you think we should try to keep
> clean? And how large a percentage of the buffer cache do you think have
> usage_count=0 at any given point in time?

What I discovered is that most of the really bad checkpoint pause cases I
ran into involved most of the buffer cache being dirty while also having a
non-zero usage count, which left the background writer hard-pressed to
work usefully (the LRU writer couldn't do anything, and the all-scan was
writing wastefully). I was seeing >90% dirty+usage_count>0 in the really
ugly spots.

What I like about Tom's idea is that it will keep the LRU writer in the
best possible zone for that case (writing out madly right behind the LRU
sweeper as counts get to zero) while still being fine on the more normal
ones like you describe. In particular, it should cut down on how much
client backends write buffers in an overloaded case considerably.

> That will vary widely depending on your workload, of course, but keeping 1/4
> of the buffer cache clean seems like overkill to me.

What may need to happen here is to add Tom's approach, but perhaps
restrain it using the current auto-tuning LRU patch's method of estimating
how many clean buffers are needed in the near future. Particularly on
large buffer caches, the idea of getting so far ahead of the sweep that
you're looping all the away around and following right behind the clock
sweep point may be overkill, but I think it will help enormously on
smaller caches that are often very dirty.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-27 04:57:30
Message-ID: Pine.GSO.4.64.0706270049000.10954@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 27 Jun 2007, ITAGAKI Takahiro wrote:

> It might be good to use statistics information about buffer usage to
> modify X runtime.

I have a complete set of working code that tracks buffer usage statistics
as the background writer scans, so that it has an idea what % of the
buffer cache is dirty, how many pages have each of the various usage
counts, that sort of thing. The problem was that the existing BGW
mechanisms were so clumsy and inefficient that giving them more
information didn't make them usefully smarter. I'll revive that code
again if it looks like it may help here.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-27 05:14:08
Message-ID: Pine.GSO.4.64.0706270058480.10954@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Tue, 26 Jun 2007, Heikki Linnakangas wrote:

> I haven't worked on [Greg's] patch. I started looking at this, using
> Itagaki's patch as the basis.

The main focus of how I reworked things was to integrate the whole thing
into the pg_stat_bgwriter mechanism. I thought that made the performance
testing a lot easier to quantify; the original patch pushed out debug info
into the logs which wasn't as helpful to me. I didn't do much with the
actual approach, my version was still following Itagki's basic insight
into the problem. I did change the smoothing method some, but as you say
that's up for grabs anyway.

> Since you have the test environment ready, can you try alternative
> patches as well as they're proposed?

The real upper limit on how much testing I can do is my home server's
capabilities, which for example aren't robust enough disk-wise to run
things like DBT2 on the scale I know you normally work on. I gots a disk
for the database, one for the WAL, 256MB of cache on the controller, and a
single dual-core procesor; can't fit too many warehouses here.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-27 07:21:04
Message-ID: 874pktualb.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Greg Smith" <gsmith(at)gregsmith(dot)com> writes:

> On Tue, 26 Jun 2007, Heikki Linnakangas wrote:
>
>> How much of the buffer cache do you think we should try to keep clean? And
>> how large a percentage of the buffer cache do you think have usage_count=0 at
>> any given point in time?
>
> What I discovered is that most of the really bad checkpoint pause cases I ran
> into involved most of the buffer cache being dirty while also having a non-zero
> usage count, which left the background writer hard-pressed to work usefully
> (the LRU writer couldn't do anything, and the all-scan was writing wastefully).
> I was seeing >90% dirty+usage_count>0 in the really ugly spots.

You keep describing this as ugly but it sounds like a really good situation to
me. The higher that percentage the better your cache hit ratio is. If you had
80% of the buffer cache be usage_count 0 that would be about average cache hit
ratio. And if you had a cache hit ratio of zero then you would find as much as
little as 50% of the buffers with usage_count>0.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-27 14:38:54
Message-ID: 17390.1182955134@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> What may need to happen here is to add Tom's approach, but perhaps
> restrain it using the current auto-tuning LRU patch's method of estimating
> how many clean buffers are needed in the near future. Particularly on
> large buffer caches, the idea of getting so far ahead of the sweep that
> you're looping all the away around and following right behind the clock
> sweep point may be overkill, but I think it will help enormously on
> smaller caches that are often very dirty.

I don't really see why it's "overkill". My assumption is that it won't
really be hard to lap the clock sweep during startup --- most likely,
on its first iteration the bgwriter will see all of the cache as not a
candidate for writing (invalid, or at worst just touched) and will be
caught up before any real load materializes. So the question is not can
it get into that state, it's whether it can stay there under load.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-27 15:04:50
Message-ID: 87y7i5xwtp.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> I don't really see why it's "overkill".

Well I think it may be overkill in that we'll be writing out buffers that
still have a decent chance of being hit again. Effectively what we'll be doing
in the approximated LRU queue is writing out any buffer that reaches the 80%
point down the list. Even if it later gets hit and pulled up to the head
again.

I suppose that's not wrong though, the whole idea of the clock sweep is that
that's precisely the level of precision to which it makes sense to approximate
the LRU. Ie, that any point in the top 20% is equivalent to any other and when
we use a buffer we want to promote it to somewhere "near" the head but any
point in the top 20% is good enough. Then any point in the last 20% should be
effectively "good enough" too be considered a target buffer to clean as well.

If we find it's overkill then what we should consider doing is raising
BM_MAX_USAGE_COUNT. That's effectively tuning the percentage of the lru chain
that we decide we try to keep clean.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Greg Smith" <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-27 15:17:42
Message-ID: 19037.1182957462@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> If we find it's overkill then what we should consider doing is raising
> BM_MAX_USAGE_COUNT. That's effectively tuning the percentage of the lru chain
> that we decide we try to keep clean.

Yeah, I don't believe anyone has tried to do performance testing for
different values of BM_MAX_USAGE_COUNT. It would be interesting to
try that after all the dust settles.

regards, tom lane


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-27 21:57:47
Message-ID: Pine.GSO.4.64.0706271754130.15663@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 27 Jun 2007, Gregory Stark wrote:

>> I was seeing >90% dirty+usage_count>0 in the really ugly spots.
>
> You keep describing this as ugly but it sounds like a really good situation to
> me. The higher that percentage the better your cache hit ratio is.

If your entire buffer cache is mostly filled with dirty buffers with high
usage counts, you are in for a long wait when you need new buffers
allocated and your next checkpoint is going to be traumatic. That's all
I'm suggesting is a problem with that situation.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-28 04:19:05
Message-ID: 20070628130743.69C5.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> wrote:

> If your entire buffer cache is mostly filled with dirty buffers with high
> usage counts, you are in for a long wait when you need new buffers
> allocated and your next checkpoint is going to be traumatic.

Do you need to increase shared_buffers in such case?

I think the condition (mostly buffers have high usage counts) is
very undesirable for us and near out-of-memory. We should deal with
such cases, of course, but is it a more effective solution to make
room in shared_buffers?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-28 12:55:50
Message-ID: Pine.GSO.4.64.0706280852500.6275@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 28 Jun 2007, ITAGAKI Takahiro wrote:

> Do you need to increase shared_buffers in such case?

If you have something going wild creating dirty buffers with a high usage
count faster than they are being written to disk, increasing the size of
the shared_buffers cache can just make the problem worse--now you have an
ever bigger pile of dirty mess to shovel at checkpoint time. The existing
background writers are particularly unsuited to helping out in this
situation, I think the new planned implementation will be much better.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Jim Nasby <decibel(at)decibel(dot)org>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-29 13:07:34
Message-ID: FF080A7B-9AFE-4E7E-96FF-D7048048D70D@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 26, 2007, at 11:57 PM, Greg Smith wrote:
> On Wed, 27 Jun 2007, ITAGAKI Takahiro wrote:
>
>> It might be good to use statistics information about buffer usage
>> to modify X runtime.
>
> I have a complete set of working code that tracks buffer usage
> statistics as the background writer scans, so that it has an idea
> what % of the buffer cache is dirty, how many pages have each of
> the various usage counts, that sort of thing. The problem was that
> the existing BGW mechanisms were so clumsy and inefficient that
> giving them more information didn't make them usefully smarter.
> I'll revive that code again if it looks like it may help here.

Even if it's not used by bgwriter for self-tuning, having that
information available would be very useful for anyone trying to hand-
tune the system.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Jim Nasby <decibel(at)decibel(dot)org>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-29 13:13:11
Message-ID: 3852A1F4-459A-4FAF-8897-400EF02692D1@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 28, 2007, at 7:55 AM, Greg Smith wrote:
> On Thu, 28 Jun 2007, ITAGAKI Takahiro wrote:
>> Do you need to increase shared_buffers in such case?
>
> If you have something going wild creating dirty buffers with a high
> usage count faster than they are being written to disk, increasing
> the size of the shared_buffers cache can just make the problem
> worse--now you have an ever bigger pile of dirty mess to shovel at
> checkpoint time. The existing background writers are particularly
> unsuited to helping out in this situation, I think the new planned
> implementation will be much better.

Is this still a serious issue with LDC? I share Greg Stark's concern
that we're going to end up wasting a lot of writes.

Perhaps part of the problem is that we're using a single count to
track buffer usage; perhaps we need separate counts for reads vs writes?
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Jim Nasby" <decibel(at)decibel(dot)org>
Cc: "Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-29 19:04:11
Message-ID: 87abuimvkk.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Jim Nasby" <decibel(at)decibel(dot)org> writes:

> Is this still a serious issue with LDC? I share Greg Stark's concern that we're
> going to end up wasting a lot of writes.

I think that's Greg Smith's concern. I do think it's something that needs to
be measured and watched for. It'll take some serious thought just to figure
out what we need to measure.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-30 02:28:15
Message-ID: Pine.GSO.4.64.0706292145090.7521@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 29 Jun 2007, Jim Nasby wrote:

> On Jun 26, 2007, at 11:57 PM, Greg Smith wrote:
>> I have a complete set of working code that tracks buffer usage
>> statistics...
>
> Even if it's not used by bgwriter for self-tuning, having that information
> available would be very useful for anyone trying to hand-tune the system.

The stats information that's in pg_stat_bgwriter combined with an
occasional snapshot of the current pg_stat_buffercache (now with usage
counts!) is just as useful. Right before freeze, I made sure everything I
was using for hand-tuning in this area made it into one of those. Really
all I do is collect that data as I happen to be scanning the buffer cache
anyway.

The way I'm keeping track of things internally is more intrusive to
collect than something I'd like to be turned on by default just for
information, and exposing what it knows to user-space isn't done yet. I
was hoping to figure out a way to use it to help justify its overhead
before bothering to optimize and report on it. The only reason I
mentioned the code at all is because I didn't want anybody else to waste
time writing that particular routine when I already have something that
works for this purpose sitting around.

> Is this still a serious issue with LDC?

Part of the reason I'm bugged about this area is that the scenario I'm
bringing up--lots of dirty and high usage buffers in a pattern the BGW
isn't good at writing causing buffer pool allocations to be slow--has the
potential to get even worse with LDC. Right now, if you're in this
particular failure mode, you can be "saved" by the next checkpoint because
it is going to flush all the dirty buffers out as fast as possible and
then you get to start over with a fairly clean slate. Once that stops
happening, I've observed the potential to run into this sort of breakdown
increase.

> I share Greg Stark's concern that we're going to end up wasting a lot of
> writes.

I don't think the goal is to write buffers significantly faster than they
have to in order to support new allocations; the idea is just to stop from
ever scanning the same section more than once when it's not possible for
it to find new things to do there. Right now there are substantial wasted
CPU/locking resources if you try to tune the LRU writer up for a heavy
load (by doing things like like increasing the percentage), as it just
keeps scanning the same high-usage count buffers over and over. With the
LRU now running during LDC, my gut feeling is its efficiency is even more
important now than it used to be. If it's wasteful of resources, that's
now even going to impact checkpoints, where before the two never happened
at the same time.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-09-26 08:31:38
Message-ID: 200709260831.l8Q8Vck10500@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Gregory Stark wrote:
>
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
> > I don't really see why it's "overkill".
>
> Well I think it may be overkill in that we'll be writing out buffers that
> still have a decent chance of being hit again. Effectively what we'll be doing
> in the approximated LRU queue is writing out any buffer that reaches the 80%
> point down the list. Even if it later gets hit and pulled up to the head
> again.
>
> I suppose that's not wrong though, the whole idea of the clock sweep is that
> that's precisely the level of precision to which it makes sense to approximate
> the LRU. Ie, that any point in the top 20% is equivalent to any other and when
> we use a buffer we want to promote it to somewhere "near" the head but any
> point in the top 20% is good enough. Then any point in the last 20% should be
> effectively "good enough" too be considered a target buffer to clean as well.
>
> If we find it's overkill then what we should consider doing is raising
> BM_MAX_USAGE_COUNT. That's effectively tuning the percentage of the lru chain
> that we decide we try to keep clean.
>
> --
> Gregory Stark
> EnterpriseDB http://www.enterprisedb.com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2008-03-11 20:46:35
Message-ID: 200803112046.m2BKkZM00474@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Added to TODO:

* Consider wither increasing BM_MAX_USAGE_COUNT improves performance

http://archives.postgresql.org/pgsql-hackers/2007-06/msg01007.php

---------------------------------------------------------------------------

Gregory Stark wrote:
>
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
> > I don't really see why it's "overkill".
>
> Well I think it may be overkill in that we'll be writing out buffers that
> still have a decent chance of being hit again. Effectively what we'll be doing
> in the approximated LRU queue is writing out any buffer that reaches the 80%
> point down the list. Even if it later gets hit and pulled up to the head
> again.
>
> I suppose that's not wrong though, the whole idea of the clock sweep is that
> that's precisely the level of precision to which it makes sense to approximate
> the LRU. Ie, that any point in the top 20% is equivalent to any other and when
> we use a buffer we want to promote it to somewhere "near" the head but any
> point in the top 20% is good enough. Then any point in the last 20% should be
> effectively "good enough" too be considered a target buffer to clean as well.
>
> If we find it's overkill then what we should consider doing is raising
> BM_MAX_USAGE_COUNT. That's effectively tuning the percentage of the lru chain
> that we decide we try to keep clean.
>
> --
> Gregory Stark
> EnterpriseDB http://www.enterprisedb.com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +