Re: Page replacement algorithm in buffer cache

From: Ants Aasma <ants(at)cybertec(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <stark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page replacement algorithm in buffer cache
Date: 2013-04-04 01:37:33
Message-ID: CA+CSw_tYFdEpXG=0W8G9tudoKwd+qkJvmdoiQ+euCbTP03mx0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 4, 2013 at 3:41 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I think the original vision of the clock sweep algorithm included the
> idea that different backends could be running the sweep over different
> parts of the buffer ring concurrently. If we could get rid of the need
> to serialize that activity, it'd pretty much eliminate the bottleneck
> I should think. The problem is how to manage it to ensure that (a)
> backends aren't actually contending on the same buffers as they do this,
> and (b) there's a reasonably fair rate of usage_count decrements for
> each buffer, rather than possibly everybody ganging up on the same area
> sometimes. Thoughts?

Fairness and avoiding each others implies that some coordination is
required. One wild idea I had is to partition buffers to N partitions
and have one clock sweep each, protected by a lwlock.

To reduce contention, the clocksweep runners use something like
sched_getcpu() to determine which partition to use to find their
buffer. Using the CPU to determine the partition makes it necessary
for the process to be scheduled out in the critical section for some
other backend to contend on the lock. And if some backend does contend
on it, it is likely to reside on the same CPU and by sleeping will
make room for the lockholder to run.

To ensure fairness for buffers, every time one of the clocksweeps
wraps around a global offset counter is incremented. This re-assigns
all cpus/backends to the next partition, sort of like the mad hatters
tea party.

The scenario that I'm most worried about here is what happens when a
process holding the clocksweep lock is migrated to another CPU and
then scheduled out. The processes on the original CPU will sooner or
later block behind the lock while the processes on the CPU where the
lock holder now waits keep hogging the CPU. This will create an
imbalance that the scheduler might try to correct, possibly creating a
nasty feedback loop. It could be that in practice the scenario works
out to be too far fetched to matter, but who knows.

I don't have a slightest idea yet how the background writer would
function in this environment. But if redesigning the bgwriter
mechanism was on the table I thought I would throw the idea out here.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-04-04 01:39:00 Re: corrupt pages detected by enabling checksums
Previous Message Daniel Farina 2013-04-04 01:36:00 Re: Interesting post-mortem on a near disaster with git