Quick Links

Re: Page replacement algorithm in buffer cache

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page replacement algorithm in buffer cache
Date:	2013-03-23 20:25:27
Message-ID:	CAMkU=1x6F9Syts67bH68iX3hOLkt+YR=ijq+M5qSCBQrbNCSiA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Mar 22, 2013 at 4:06 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:

>
>

> Not yet, I figured this might be a problem and am designing test cases
> for the same. I would be glad for some help there please.
>

Perhaps this isn't the help you were looking for, but I spent a long time
looking into this a few years ago. Then I stopped and decided to work on
other things. I would recommend you do so too.

If I have to struggle to come up with an artificial test case that shows
that there is a problem, then why should I believe that there actually is a
problem? If you take a well-known problem (like, say, bad performance at
shared_buffers > 8GB (or even lower, on Windows)) and create an artificial
test case to exercise and investigate that, that is one thing. But why
invent pathological test cases with no known correspondence to reality?
There are plenty of real problems to work on, and some of them are just as
intellectually interesting as the artificial problems are.

My conclusions were:

1) If everything fits in shared_buffers, then the replacement policy
doesn't matter.

2) If shared_buffers is much smaller than RAM (the most common case, I
believe), then what mostly matters is your OS's replacement policy, not
pgsql's. Not much a pgsql hacker can do about this, other than turn into a
kernel hacker.

3) If little of the highly-used data fits in RAM. then any non-absurd
replacement policy is about as good as any other non-absurd one.

4) If most, but not quite all, of the highly-used data fits shared_buffers
and shared_buffers takes most of RAM (or at least, most of RAM not already
needed for other things like work_mem and executables), then the
replacement policy matters a lot. But different policies suit different
work-loads, and there is little reason to think we can come up with a way
to choose between them. (Also, in these conditions, performance is very
chaotic. You can run the same algorithm for a long time, and it can
suddenly switch from good to bad or the other way around, and then stay in
that new mode for a long time). Also, even if you come up with a good
algorithm, if you make the data set 20% smaller or 20% larger, it is no
longer a good algorithm.

5) Having buffers enter with usage_count=0 rather than 1 would probably be
slightly better most of the time under conditions described in 4, but there
is no way get enough evidence of this over enough conditions to justify
making a change. And besides, how often do people run with shared_buffers
being most of RAM, and the hot data just barely not fitting in it?

If you want some known problems that are in this general area, we have:

1) If all data fits in RAM but not shared_buffers, and you have a very
large number of CPUs and a read-only or read-mostly workload,
then BufFreelistLock can be a major bottle neck. (But, on a Amazon
high-CPU instance, I did not see this very much. I suspect the degree of
problem depends a lot on whether you have a lot of sockets with a few CPUs
each, versus one chip with many CPUs). This is very easy to come up with
model cases for, pgbench -S -c8 -j8, for example, can often show it.

2) A major reason that people run with shared_buffers much lower than RAM
is that performance seems to suffer with shared_buffers > 8GB under
write-heavy workloads, even with spread-out checkpoints. This is
frequently reported as a real world problem, but as far as I know has never
been reduced to a simple reproducible test case. (Although there was a
recent thread, maybe "High CPU usage / load average after upgrading to
Ubuntu 12.04", that I thought might be relevant to this. I haven't had the
time to seriously study the thread, or the hardware to investigate it
myself)

Cheers,

Jeff

In response to

Re: Page replacement algorithm in buffer cache at 2013-03-22 11:06:13 from Atri Sharma

Responses

Re: Page replacement algorithm in buffer cache at 2013-03-24 07:20:02 from Atri Sharma

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nicholas White	2013-03-23 22:25:02	Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls
Previous Message	Tom Lane	2013-03-23 20:07:42	Re: Page replacement algorithm in buffer cache