Re: Background LRU Writer/free list

Lists: pgsql-hackers
From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Background LRU Writer/free list
Date: 2007-04-18 13:09:11
Message-ID: Pine.GSO.4.64.0704180820220.14766@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'm mostly done with my review of the "Automatic adjustment of
bgwriter_lru_maxpages" patch. In addition to issues already brought up
with that code, there are some small things that need to be done to merge
it with the recent pg_stat_bgwriter patch, and I have some concerns about
its unbounded scanning of the buffer pool; I'll write that up in more
detail or just submit an improved patch as I get time this week.

But there's a fundamental question that has been bugging me, and I think
it impacts the direction that code should take. Unless I'm missing
something in my reading, buffers written out by the LRU writer aren't ever
put onto the free list. I assume this is to stop from prematurely
removing buffers that contain useful data. In cases where a substantial
percentage of the buffer cache is dirty, the LRU writer has to scan a
significant portion of the pool looking for one of the rare clean buffers,
then write it out. When a client goes to grab a free buffer afterward, it
has to scan the same section of the pool to find the now clean buffer,
which seems redundant.

With the new patch, the LRU writer is fairly well bounded in that it
doesn't write out more than it thinks it will need; you shouldn't get into
a situation where many more pages are written than will be used in the
near future. Given that mindset, shouldn't pages the LRU scan writes just
get moved onto the free list?

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Background LRU Writer/free list
Date: 2007-04-18 17:01:04
Message-ID: 20070418170104.GT72669@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 18, 2007 at 09:09:11AM -0400, Greg Smith wrote:
> I'm mostly done with my review of the "Automatic adjustment of
> bgwriter_lru_maxpages" patch. In addition to issues already brought up
> with that code, there are some small things that need to be done to merge
> it with the recent pg_stat_bgwriter patch, and I have some concerns about
> its unbounded scanning of the buffer pool; I'll write that up in more
> detail or just submit an improved patch as I get time this week.
>
> But there's a fundamental question that has been bugging me, and I think
> it impacts the direction that code should take. Unless I'm missing
> something in my reading, buffers written out by the LRU writer aren't ever
> put onto the free list. I assume this is to stop from prematurely
> removing buffers that contain useful data. In cases where a substantial
> percentage of the buffer cache is dirty, the LRU writer has to scan a
> significant portion of the pool looking for one of the rare clean buffers,
> then write it out. When a client goes to grab a free buffer afterward, it
> has to scan the same section of the pool to find the now clean buffer,
> which seems redundant.
>
> With the new patch, the LRU writer is fairly well bounded in that it
> doesn't write out more than it thinks it will need; you shouldn't get into
> a situation where many more pages are written than will be used in the
> near future. Given that mindset, shouldn't pages the LRU scan writes just
> get moved onto the free list?

I've wondered the same thing myself.

If we're worried about freeing pages that we might want back, we could
change the code so that ReadBuffer would also look at the free list if
it couldn't find a page before going to the OS for it.

So if you make this change will BgBufferSync start incrementing
StrategyControl->nextVictimBuffer and decrementing buf->usage_count like
StrategyGetBuffer does now?
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Background LRU Writer/free list
Date: 2007-04-18 17:33:40
Message-ID: 87ps61poy3.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Greg Smith" <gsmith(at)gregsmith(dot)com> writes:

> I'm mostly done with my review of the "Automatic adjustment of
> bgwriter_lru_maxpages" patch. In addition to issues already brought up with
> that code, there are some small things that need to be done to merge it with
> the recent pg_stat_bgwriter patch, and I have some concerns about its unbounded
> scanning of the buffer pool; I'll write that up in more detail or just submit
> an improved patch as I get time this week.

I had a thought on this. Instead of sleeping for a constant amount of time and
then estimating the number of pages needed for that constant amount of time
perhaps what bgwriter should be doing is sleeping for a variable amount of
time and estimating the length of time it needs to sleep to arrive at a
constant number of pages being needed.

The reason I think this may be better is that "what percentage of the shared
buffers the bgwriter allows to get old between wakeups" seems more likely to
be a universal constant that people won't have to adjust than "fixed time
interval between bgwriter cleanup operations".

Just a thought.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Background LRU Writer/free list
Date: 2007-04-18 17:53:06
Message-ID: 5838.1176918786@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:
> With the new patch, the LRU writer is fairly well bounded in that it
> doesn't write out more than it thinks it will need; you shouldn't get into
> a situation where many more pages are written than will be used in the
> near future. Given that mindset, shouldn't pages the LRU scan writes just
> get moved onto the free list?

This just seems like a really bad idea: throwing away data we might
want. Furthermore, if the page was dirty, then it's probably been
accessed more recently than adjacent pages that are clean, so
preferentially zapping just-written pages seems backwards.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Greg Smith" <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Background LRU Writer/free list
Date: 2007-04-18 18:02:33
Message-ID: 5962.1176919353@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> I had a thought on this. Instead of sleeping for a constant amount of time and
> then estimating the number of pages needed for that constant amount of time
> perhaps what bgwriter should be doing is sleeping for a variable amount of
> time and estimating the length of time it needs to sleep to arrive at a
> constant number of pages being needed.

That's an interesting idea, but a possible problem with it is that we
can't vary the granularity of a sleep time as finely as we can vary the
number of buffers processed per iteration. Assuming that the system's
tick rate is the typical 100Hz, we have only 10ms resolution on sleep
times.

> The reason I think this may be better is that "what percentage of the shared
> buffers the bgwriter allows to get old between wakeups" seems more likely to
> be a universal constant that people won't have to adjust than "fixed time
> interval between bgwriter cleanup operations".

Why? What you're really trying to determine, I think, is the I/O load
imposed by the bgwriter, and pages-per-second seems a pretty natural
way to think about that; percentage of shared buffers not so much.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Background LRU Writer/free list
Date: 2007-04-18 18:28:03
Message-ID: 87hcrdpmfg.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Why? What you're really trying to determine, I think, is the I/O load
> imposed by the bgwriter, and pages-per-second seems a pretty natural
> way to think about that; percentage of shared buffers not so much.

What I'm saying is that pages/s will vary from system to system. Busier
systems will have higher i/o rates. So a system with a DBA on a system with a
higher rate will want to adjust the bgwriter sleep time lower than the DBA on
a system where bgwriter isn't doing much work.

In particular I'm worried about what happens on a very busy cpu-bound system
where adjusting the sleep times would result in it deciding to not sleep at
all. On such a system sleeping for even 10ms might be too long. But we
probably don't want to make the default even as low as 10ms.

Anyways, if we have a working patch that works the other way around we could
experiment with that and see if there are actual situations where sleeping for
0ms is necessary. Perhaps a mixture of the two approaches will be necessary
anyways because of the granularity issue.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Background LRU Writer/free list
Date: 2007-04-19 02:57:08
Message-ID: Pine.GSO.4.64.0704182048190.18788@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 18 Apr 2007, Tom Lane wrote:

> Furthermore, if the page was dirty, then it's probably been accessed
> more recently than adjacent pages that are clean, so preferentially
> zapping just-written pages seems backwards.

The LRU background writer only writes out pages that have a usage_count of
0, so they can't haven't been accessed too recently. Assuming the buffer
allocation rate continues its historical trend, these are the pages that
are going to be written out and then allocated for something new one way
or another in the next interval; the content is expected to be lost
shortly no matter what.

As for preferring dirty pages over clean ones, on a re-read my question
wasn't as clear as I wanted to be. I think that clean pages near the
strategy point should also be moved to the free list by the background
writer. You know clients are expected to require x buffers in the next y
ms based on the history of the server (the new piece of information
provided by the patch in the queue), and the LRU background writer is
working in advance to make them available. If you're doing all that,
doesn't it make sense to finish the job by putting the pages on the free
list, where the clients can grab them without running their own scan over
the buffer cache?

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Background LRU Writer/free list
Date: 2007-04-19 03:04:04
Message-ID: Pine.GSO.4.64.0704182259270.7075@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 18 Apr 2007, Jim C. Nasby wrote:

> So if you make this change will BgBufferSync start incrementing
> StrategyControl->nextVictimBuffer and decrementing buf->usage_count like
> StrategyGetBuffer does now?

Something will need to keep advancing the nextVictimBuffer, I hadn't
really finished implementation yet; I just wanted to get an idea if this
was even feasible, or if there was some larger issue that made the whole
idea moot.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Background LRU Writer/free list
Date: 2007-04-19 04:10:45
Message-ID: Pine.GSO.4.64.0704182304290.7075@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 18 Apr 2007, Gregory Stark wrote:

> In particular I'm worried about what happens on a very busy cpu-bound
> system where adjusting the sleep times would result in it deciding to
> not sleep at all. On such a system sleeping for even 10ms might be too
> long... Anyways, if we have a working patch that works the other way
> around we could experiment with that and see if there are actual
> situations where sleeping for 0ms is necessary.

I've been waiting for 8.3 to settle down before packaging the prototype
auto-tuning background writer concept I'm working on (you can peek at the
code at http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c ),
which already implements some of the ideas you're talking about in your
messages today. I estimate how much of the buffer pool is dirty, use that
to compute an expected I/O rate, and try to adjust parameters to meet a
quality of service guarantee for how often the entire buffer pool is
scanned. This is one of those problems that gets more difficult the more
you dig into it; with all that done I still feel like I'm only halfway
finished and several parts worked radically different in reality than I
expected them to.

If you're allowing the background writer to write 1000 pages at a clip,
that's 8MB each interval. Doing that every 200ms makes for an I/O rate of
40MB/s. In a system that cares about data integrity, you'll exceed the
ability of the WAL to sustain page writes (which limits how fast you can
dirty pages) long before the interval approaches 0ms. What I do in my
code is set the interval to 200ms, compute what the maximum pages to write
must be, and if it's >1000 then I reduce the interval. I've tested
dumping into a fairly fast disk array with tons of cache and I've never
been able to get useful throughput below an 80ms interval; the OS just
clamps down and makes you wait for I/O instead regardless of how little
you intended to sleep. Eventually, it's got to hit disk, and you can only
buffer for so long before that starts to slow you down.

Anyway, this is a tangent discussion. The LRU patch that's in the queue
doesn't really care if it runs with a short interval or a long one,
because it automatically scales how much work it does according to how
much time passed. I think that many only be a bit of tweaking away from a
solid solution. Tuning the all scan, which is what you're talking about
when you speak in terms of the statistics about the overall buffer pool,
is a much harder job.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Background LRU Writer/free list
Date: 2007-04-27 01:28:14
Message-ID: 200704270128.l3R1SEe26265@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Greg Smith wrote:
> I'm mostly done with my review of the "Automatic adjustment of
> bgwriter_lru_maxpages" patch. In addition to issues already brought up
> with that code, there are some small things that need to be done to merge
> it with the recent pg_stat_bgwriter patch, and I have some concerns about
> its unbounded scanning of the buffer pool; I'll write that up in more
> detail or just submit an improved patch as I get time this week.
>
> But there's a fundamental question that has been bugging me, and I think
> it impacts the direction that code should take. Unless I'm missing
> something in my reading, buffers written out by the LRU writer aren't ever
> put onto the free list. I assume this is to stop from prematurely
> removing buffers that contain useful data. In cases where a substantial
> percentage of the buffer cache is dirty, the LRU writer has to scan a
> significant portion of the pool looking for one of the rare clean buffers,
> then write it out. When a client goes to grab a free buffer afterward, it
> has to scan the same section of the pool to find the now clean buffer,
> which seems redundant.
>
> With the new patch, the LRU writer is fairly well bounded in that it
> doesn't write out more than it thinks it will need; you shouldn't get into
> a situation where many more pages are written than will be used in the
> near future. Given that mindset, shouldn't pages the LRU scan writes just
> get moved onto the free list?
>
> --
> * Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Background LRU Writer/free list
Date: 2007-09-26 08:26:30
Message-ID: 200709260826.l8Q8QUF06377@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Greg Smith wrote:
> I'm mostly done with my review of the "Automatic adjustment of
> bgwriter_lru_maxpages" patch. In addition to issues already brought up
> with that code, there are some small things that need to be done to merge
> it with the recent pg_stat_bgwriter patch, and I have some concerns about
> its unbounded scanning of the buffer pool; I'll write that up in more
> detail or just submit an improved patch as I get time this week.
>
> But there's a fundamental question that has been bugging me, and I think
> it impacts the direction that code should take. Unless I'm missing
> something in my reading, buffers written out by the LRU writer aren't ever
> put onto the free list. I assume this is to stop from prematurely
> removing buffers that contain useful data. In cases where a substantial
> percentage of the buffer cache is dirty, the LRU writer has to scan a
> significant portion of the pool looking for one of the rare clean buffers,
> then write it out. When a client goes to grab a free buffer afterward, it
> has to scan the same section of the pool to find the now clean buffer,
> which seems redundant.
>
> With the new patch, the LRU writer is fairly well bounded in that it
> doesn't write out more than it thinks it will need; you shouldn't get into
> a situation where many more pages are written than will be used in the
> near future. Given that mindset, shouldn't pages the LRU scan writes just
> get moved onto the free list?
>
> --
> * Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Background LRU Writer/free list
Date: 2008-03-11 15:44:00
Message-ID: 200803111544.m2BFi0D06645@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Added to TODO:

* Consider adding buffers the BGW finds reusable to the free list

http://archives.postgresql.org/pgsql-hackers/2007-04/msg00781.php

* Automatically tune bgwriter_delay based on activity rather then using a
fixed interval

http://archives.postgresql.org/pgsql-hackers/2007-04/msg00781.php

---------------------------------------------------------------------------

Greg Smith wrote:
> I'm mostly done with my review of the "Automatic adjustment of
> bgwriter_lru_maxpages" patch. In addition to issues already brought up
> with that code, there are some small things that need to be done to merge
> it with the recent pg_stat_bgwriter patch, and I have some concerns about
> its unbounded scanning of the buffer pool; I'll write that up in more
> detail or just submit an improved patch as I get time this week.
>
> But there's a fundamental question that has been bugging me, and I think
> it impacts the direction that code should take. Unless I'm missing
> something in my reading, buffers written out by the LRU writer aren't ever
> put onto the free list. I assume this is to stop from prematurely
> removing buffers that contain useful data. In cases where a substantial
> percentage of the buffer cache is dirty, the LRU writer has to scan a
> significant portion of the pool looking for one of the rare clean buffers,
> then write it out. When a client goes to grab a free buffer afterward, it
> has to scan the same section of the pool to find the now clean buffer,
> which seems redundant.
>
> With the new patch, the LRU writer is fairly well bounded in that it
> doesn't write out more than it thinks it will need; you shouldn't get into
> a situation where many more pages are written than will be used in the
> near future. Given that mindset, shouldn't pages the LRU scan writes just
> get moved onto the free list?
>
> --
> * Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +