Re: Bug: Buffer cache is not scan resistant

Lists: pgsql-hackers
From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Bug: Buffer cache is not scan resistant
Date: 2007-03-04 19:11:26
Message-ID: C21059DE.2862E%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'm putting this out there before we publish a fix so that we can discuss
how best to fix it.

Doug and Sherry recently found the source of an important performance issue
with the Postgres shared buffer cache.

The issue is summarized like this: the buffer cache in PGSQL is not "scan
resistant" as advertised. A sequential scan of a table larger than cache
will pollute the buffer cache in almost all circumstances.

Here is performance of GPDB 2.301 (Postgres 8.1.6) on a single X4500
(thumper-3) with 4 cores where "bigtable" is a table 2x the size of RAM and
"memtable" is a table that fits into I/O cache:

With our default setting of shared_buffers (16MB):

Operation memtable bigtable
---------------------------------------------------
SELECT COUNT(*) 1221 MB/s 973 MB/s
VACUUM 1709 MB/s 1206 MB/s

We had observed that VACUUM would perform better when done right after a
SELECT. In the above example, the faster rate from disk was 1608 MB/s,
compared to the normal rate of 1206 MB/s.

We verified this behavior on Postgres 8.2 as well. The buffer selection
algorithm is choosing buffer pages scattered throughout the buffer cache in
almost all circumstances.

Sherry traced the behavior to the processor repeatedly flushing the L2
cache. Doug found that we weren't using the Postgres buffer cache the way
we expected, instead we were loading the scanned data from disk into the
cache even though there was no possibility of reusing it. In addition to
pushing other, possibly useful pages from the cache, it has the additional
behavior of invalidating the L2 cache for the remainder of the executor path
that uses the data.

To prove that the buffer cache was the source of the problem, we dropped the
shared buffer size to fit into L2 cache (1MB per Opteron core), and this is
what we saw (drop size of shared buffers to 680KB):

Operation memtable bigtable
---------------------------------------------------
SELECT COUNT(*) 1320 MB/s 1059 MB/s
VACUUM 3033 MB/s 1597 MB/s

These results do not vary with the order of operations.

Thoughts on the best way to fix the buffer selection algorithm? Ideally,
one page would be used in the buffer cache in circumstances where the table
to be scanned is (significantly?) larger than the size of the buffer cache.

- Luke


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
Cc: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 01:36:43
Message-ID: 27852.1173058603@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:
> The issue is summarized like this: the buffer cache in PGSQL is not "scan
> resistant" as advertised.

Sure it is. As near as I can tell, your real complaint is that the
bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
which is hardly surprising considering it doesn't know the size of L2
cache. That's not a consideration that we've ever taken into account.

I'm also less than convinced that it'd be helpful for a big seqscan:
won't reading a new disk page into memory via DMA cause that memory to
get flushed from the processor cache anyway? I wonder whether your
numbers are explained by some other consideration than you think.

regards, tom lane


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Luke Lonergan <llonergan(at)greenplum(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 05:03:54
Message-ID: 45EBA4BA.7070507@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> "Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:
>> The issue is summarized like this: the buffer cache in PGSQL is not "scan
>> resistant" as advertised.
>
> Sure it is. As near as I can tell, your real complaint is that the
> bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
> which is hardly surprising considering it doesn't know the size of L2
> cache. That's not a consideration that we've ever taken into account.
>

To add a little to this - forgetting the scan resistant point for the
moment... cranking down shared_buffers to be smaller than the L2 cache
seems to help *any* sequential scan immensely, even on quite modest HW:

e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram,

SELECT count(*) FROM lineitem (which is about 11GB) performance:

Shared_buffers Elapsed
-------------- -------
400MB 101 s
128KB 74 s

When I've profiled this activity, I've seen a lot of time spent
searching for/allocating a new buffer for each page being fetched.
Obviously having less of them to search through will help, but having
less than the L2 cache-size worth of 'em seems to help a whole lot!

Cheers

Mark


From: Gavin Sherry <swm(at)alcove(dot)com(dot)au>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <llonergan(at)greenplum(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 05:16:44
Message-ID: Pine.LNX.4.58.0703051614070.19071@linuxworld.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 5 Mar 2007, Mark Kirkwood wrote:

> To add a little to this - forgetting the scan resistant point for the
> moment... cranking down shared_buffers to be smaller than the L2 cache
> seems to help *any* sequential scan immensely, even on quite modest HW:
>
> e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram,
>
> SELECT count(*) FROM lineitem (which is about 11GB) performance:
>
> Shared_buffers Elapsed
> -------------- -------
> 400MB 101 s
> 128KB 74 s
>
> When I've profiled this activity, I've seen a lot of time spent
> searching for/allocating a new buffer for each page being fetched.
> Obviously having less of them to search through will help, but having
> less than the L2 cache-size worth of 'em seems to help a whole lot!

Could you demonstrate that point by showing us timings for shared_buffers
sizes from 512K up to, say, 2 MB? The two numbers you give there might
just have to do with managing a large buffer.

Thanks,

Gavin


From: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
To: "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 05:52:53
Message-ID: C3E62232E3BCF24CBA20D72BFDCB6BF802CFC0AF@MI8NYCMAIL08.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gavin, Mark,

> Could you demonstrate that point by showing us timings for
> shared_buffers sizes from 512K up to, say, 2 MB? The two
> numbers you give there might just have to do with managing a
> large buffer.

I suggest two experiments that we've already done:
1) increase shared buffers to double the L2 cache size, you should see
that the behavior reverts to the "slow" performance and is constant at
larger sizes

2) instrument the calls to BufferGetPage() (a macro) and note that the
buffer block numbers returned increase sequentially during scans of
tables larger than the buffer size

- Luke


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gavin Sherry <swm(at)alcove(dot)com(dot)au>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Luke Lonergan <llonergan(at)greenplum(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 06:25:42
Message-ID: 3658.1173075942@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gavin Sherry <swm(at)alcove(dot)com(dot)au> writes:
> Could you demonstrate that point by showing us timings for shared_buffers
> sizes from 512K up to, say, 2 MB? The two numbers you give there might
> just have to do with managing a large buffer.

Using PG CVS HEAD on 64-bit Intel Xeon (1MB L2 cache), Fedora Core 5,
I don't measure any noticeable difference in seqscan speed for
shared_buffers set to 32MB or 256kB. I note that the code would
not let me choose the latter setting without a large decrease in
max_connections, which might be expected to cause some performance
changes in itself.

Now this may only prove that the disk subsystem on this machine is
too cheap to let the system show any CPU-related issues. I'm seeing
a scan rate of about 43MB/sec for both count(*) and plain ol' "wc",
which is a factor of 4 or so less than Mark's numbers suggest...
but "top" shows CPU usage of less than 5%, so even with a 4x faster
disk I'd not really expect that CPU speed would become interesting.

(This is indeed a milestone, btw, because it wasn't so long ago that
count(*) was nowhere near disk speed.)

regards, tom lane


From: Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 08:17:06
Message-ID: 6A435FB4-81CF-49A2-9001-E8D34581CDE3@pointblue.com.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Mar 5, 2007, at 2:36 AM, Tom Lane wrote:
> n into account.
>
> I'm also less than convinced that it'd be helpful for a big seqscan:
> won't reading a new disk page into memory via DMA cause that memory to
> get flushed from the processor cache anyway?

Nope. DMA is writing directly into main memory. If the area was in
the L2/L1 cache, it will get invalidated. But if it isn't there, it
is okay.

--
Grzegorz Jaskiewicz
gj(at)pointblue(dot)com(dot)pl


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>
Cc: "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 08:24:11
Message-ID: 8324.1173083051@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl> writes:
> On Mar 5, 2007, at 2:36 AM, Tom Lane wrote:
>> I'm also less than convinced that it'd be helpful for a big seqscan:
>> won't reading a new disk page into memory via DMA cause that memory to
>> get flushed from the processor cache anyway?

> Nope. DMA is writing directly into main memory. If the area was in
> the L2/L1 cache, it will get invalidated. But if it isn't there, it
> is okay.

So either way, it isn't in processor cache after the read. So how can
there be any performance benefit?

regards, tom lane


From: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Grzegorz Jaskiewicz" <gj(at)pointblue(dot)com(dot)pl>
Cc: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 08:28:49
Message-ID: C3E62232E3BCF24CBA20D72BFDCB6BF802CFC0C0@MI8NYCMAIL08.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> So either way, it isn't in processor cache after the read.
> So how can there be any performance benefit?

It's the copy from kernel IO cache to the buffer cache that is L2
sensitive. When the shared buffer cache is polluted, it thrashes the L2
cache. When the number of pages being written to in the kernel->user
space writes fits in L2, then the L2 lines are "written through" (see
the link below on page 264 for the write combining features of the
opteron for example) and the writes to main memory are deferred.

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/
25112.PDF

- Luke


From: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>
Cc: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 08:30:13
Message-ID: C3E62232E3BCF24CBA20D72BFDCB6BF802CFC0C2@MI8NYCMAIL08.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Tom,

> Now this may only prove that the disk subsystem on this
> machine is too cheap to let the system show any CPU-related
> issues.

Try it with a warm IO cache. As I posted before, we see double the
performance of a VACUUM from a table in IO cache when the shared buffer
cache isn't being polluted. The speed with large buffer cache should be
about 450 MB/s and the speed with a buffer cache smaller than L2 should
be about 800 MB/s.

The real issue here isn't the L2 behavior, though that's important when
trying to reach very high IO speeds, the issue is that we're seeing the
buffer cache pollution in the first place. When we instrument the
blocks selected by the buffer page selection algorithm, we see that they
iterate sequentially, filling the shared buffer cache. That's the
source of the problem here.

Do we have a regression test somewhere for this?

- Luke


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
Cc: "Grzegorz Jaskiewicz" <gj(at)pointblue(dot)com(dot)pl>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 08:45:17
Message-ID: 8532.1173084317@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Luke Lonergan" <LLonergan(at)greenplum(dot)com> writes:
>> So either way, it isn't in processor cache after the read.
>> So how can there be any performance benefit?

> It's the copy from kernel IO cache to the buffer cache that is L2
> sensitive. When the shared buffer cache is polluted, it thrashes the L2
> cache. When the number of pages being written to in the kernel->user
> space writes fits in L2, then the L2 lines are "written through" (see
> the link below on page 264 for the write combining features of the
> opteron for example) and the writes to main memory are deferred.

That makes absolutely zero sense. The data coming from the disk was
certainly not in processor cache to start with, and I hope you're not
suggesting that it matters whether the *target* page of a memcpy was
already in processor cache. If the latter, it is not our bug to fix.

> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/
> 25112.PDF

Even granting that your conclusions are accurate, we are not in the
business of optimizing Postgres for a single CPU architecture.

regards, tom lane


From: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Grzegorz Jaskiewicz" <gj(at)pointblue(dot)com(dot)pl>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 08:51:38
Message-ID: C3E62232E3BCF24CBA20D72BFDCB6BF802CFC0C6@MI8NYCMAIL08.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Hi Tom,

> Even granting that your conclusions are accurate, we are not
> in the business of optimizing Postgres for a single CPU architecture.

I think you're missing my/our point:

The Postgres shared buffer cache algorithm appears to have a bug. When
there is a sequential scan the blocks are filling the entire shared
buffer cache. This should be "fixed".

My proposal for a fix: ensure that when relations larger (much larger?)
than buffer cache are scanned, they are mapped to a single page in the
shared buffer cache.

- Luke


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Luke Lonergan <LLonergan(at)greenplum(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 09:09:32
Message-ID: 45EBDE4C.9030709@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Luke Lonergan wrote:
> The Postgres shared buffer cache algorithm appears to have a bug. When
> there is a sequential scan the blocks are filling the entire shared
> buffer cache. This should be "fixed".
>
> My proposal for a fix: ensure that when relations larger (much larger?)
> than buffer cache are scanned, they are mapped to a single page in the
> shared buffer cache.

It's not that simple. Using the whole buffer cache for a single seqscan
is ok, if there's currently no better use for the buffer cache. Running
a single select will indeed use the whole cache, but if you run any
other smaller queries, the pages they need should stay in cache and the
seqscan will loop through the other buffers.

In fact, the pages that are left in the cache after the seqscan finishes
would be useful for the next seqscan of the same table if we were smart
enough to read those pages first. That'd make a big difference for
seqscanning a table that's say 1.5x your RAM size. Hmm, I wonder if
Jeff's sync seqscan patch adresses that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Hannu Krosing <hannu(at)skype(dot)net>
To: Luke Lonergan <LLonergan(at)greenplum(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 09:10:24
Message-ID: 1173085824.3279.1.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Ühel kenal päeval, E, 2007-03-05 kell 03:51, kirjutas Luke Lonergan:
> Hi Tom,
>
> > Even granting that your conclusions are accurate, we are not
> > in the business of optimizing Postgres for a single CPU architecture.
>
> I think you're missing my/our point:
>
> The Postgres shared buffer cache algorithm appears to have a bug. When
> there is a sequential scan the blocks are filling the entire shared
> buffer cache. This should be "fixed".
>
> My proposal for a fix: ensure that when relations larger (much larger?)
> than buffer cache are scanned, they are mapped to a single page in the
> shared buffer cache.

How will this approach play together with synchronized scan patches ?

Or should synchronized scan rely on systems cache only ?

> - Luke
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
Cc: "Grzegorz Jaskiewicz" <gj(at)pointblue(dot)com(dot)pl>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 09:15:45
Message-ID: 9824.1173086145@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Luke Lonergan" <LLonergan(at)greenplum(dot)com> writes:
> I think you're missing my/our point:

> The Postgres shared buffer cache algorithm appears to have a bug. When
> there is a sequential scan the blocks are filling the entire shared
> buffer cache. This should be "fixed".

No, this is not a bug; it is operating as designed. The point of the
current bufmgr algorithm is to replace the page least recently used,
and that's what it's doing.

If you want to lobby for changing the algorithm, then you need to
explain why one test case on one platform justifies de-optimizing
for a lot of other cases. In almost any concurrent-access situation
I think that what you are suggesting would be a dead loss --- for
instance we might as well forget about Jeff Davis' synchronized-scan
work.

In any case, I'm still not convinced that you've identified the problem
correctly, because your explanation makes no sense to me. How can the
processor's L2 cache improve access to data that it hasn't got yet?

regards, tom lane


From: Florian Weimer <fweimer(at)bfk(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "Grzegorz Jaskiewicz" <gj(at)pointblue(dot)com(dot)pl>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 09:20:18
Message-ID: 82fy8k59nx.fsf@mid.bfk.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane:

> That makes absolutely zero sense. The data coming from the disk was
> certainly not in processor cache to start with, and I hope you're not
> suggesting that it matters whether the *target* page of a memcpy was
> already in processor cache. If the latter, it is not our bug to fix.

Uhm, if it's not in the cache, you typically need to evict some cache
lines to make room for the data, so I'd expect an indirect performance
hit. I could be mistaken, though.

--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99


From: Hannu Krosing <hannu(at)skype(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 09:41:49
Message-ID: 1173087709.3279.14.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Ühel kenal päeval, E, 2007-03-05 kell 04:15, kirjutas Tom Lane:
> "Luke Lonergan" <LLonergan(at)greenplum(dot)com> writes:
> > I think you're missing my/our point:
>
> > The Postgres shared buffer cache algorithm appears to have a bug. When
> > there is a sequential scan the blocks are filling the entire shared
> > buffer cache. This should be "fixed".
>
> No, this is not a bug; it is operating as designed.

Maybe he means that there is an oversight (aka "bug") in the design ;)

> The point of the
> current bufmgr algorithm is to replace the page least recently used,
> and that's what it's doing.
>
> If you want to lobby for changing the algorithm, then you need to
> explain why one test case on one platform justifies de-optimizing
> for a lot of other cases.

If you know beforehand that you will definitely overflow cache and not
reuse it anytime soon, then it seems quite reasonable to not even start
polluting the cache. Especially, if you get a noticable boost in
performance while doing so.

> In almost any concurrent-access situation
> I think that what you are suggesting would be a dead loss

Only if the concurrent access patern is over data mostly fitting in
buffer cache. If we can avoid polluting buffer cache with data we know
we will use only once, more useful data will be available.

> --- for
> instance we might as well forget about Jeff Davis' synchronized-scan
> work.

Depends on ratio of system cache/shared buffer cache. I don't think
Jeff's patch is anywere near the point it needs to start worrying about
data swapping between system cache and shared burrers, or L2 cache usage

> In any case, I'm still not convinced that you've identified the problem
> correctly, because your explanation makes no sense to me. How can the
> processor's L2 cache improve access to data that it hasn't got yet?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate
--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com


From: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Grzegorz Jaskiewicz" <gj(at)pointblue(dot)com(dot)pl>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 09:58:55
Message-ID: C3E62232E3BCF24CBA20D72BFDCB6BF802CFC0D6@MI8NYCMAIL08.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> > The Postgres shared buffer cache algorithm appears to have a bug.
> > When there is a sequential scan the blocks are filling the entire
> > shared buffer cache. This should be "fixed".
>
> No, this is not a bug; it is operating as designed. The
> point of the current bufmgr algorithm is to replace the page
> least recently used, and that's what it's doing.

At least we've established that for certain.

> If you want to lobby for changing the algorithm, then you
> need to explain why one test case on one platform justifies
> de-optimizing for a lot of other cases. In almost any
> concurrent-access situation I think that what you are
> suggesting would be a dead loss --- for instance we might as
> well forget about Jeff Davis' synchronized-scan work.

Instead of forgetting about it, we'd need to change it.

> In any case, I'm still not convinced that you've identified
> the problem correctly, because your explanation makes no
> sense to me. How can the processor's L2 cache improve access
> to data that it hasn't got yet?

The evidence seems to clearly indicate reduced memory writing due to an
L2 related effect. The actual data shows a dramatic reduction in main
memory writing when the destination of the written data fits in the L2
cache.

I'll try to fit a hypothesis to explain it. Assume you've got a warm IO
cache in the OS.

The heapscan algorithm now works like this:
0) select a destination user buffer
1) uiomove->kcopy memory from the IO cache to the user buffer
1A) step 1: read from kernel space
1B) step 2: write to user space
2) the user buffer is accessed many times by the executor nodes above
Repeat

There are two situations we are evaluating: one where the addresses of
the user buffer are scattered over a space larger than the size of L2
(caseA) and one where they are confined to the size of L2 (caseB). Note
that we could also consider another situation where the addresses are
scattered over a space smaller than the TLB entries mapped by the L2
cache (512 max) and larger than the size of L2, but we've tried that and
it proved uninteresting.

For both cases step 1A is the same: each block (8KB) write from (1) will
read from IO cache into 128 L2 (64B each) lines, evicting the previous
data there.

In step 1B for caseA the destination for the writes is mostly an address
not currently mapped into L2 cache, so 128 victim L2 lines are found
(LRU), stored into, and writes are flushed to main memory.

In step 1B for caseB, the destination for the writes is located in L2
already. The 128 L2 lines are stored into, and the write to main memory
is delayed under the assumption that these lines are "hot" as they were
already in L2.

I don't know enough to be sure this is the right answer, but it does fit
the experimental data.

- Luke


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Gavin Sherry <swm(at)alcove(dot)com(dot)au>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <llonergan(at)greenplum(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 10:00:13
Message-ID: 45EBEA2D.1010200@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gavin Sherry wrote:
> On Mon, 5 Mar 2007, Mark Kirkwood wrote:
>
>> To add a little to this - forgetting the scan resistant point for the
>> moment... cranking down shared_buffers to be smaller than the L2 cache
>> seems to help *any* sequential scan immensely, even on quite modest HW:
>>
> (snipped)
>> When I've profiled this activity, I've seen a lot of time spent
>> searching for/allocating a new buffer for each page being fetched.
>> Obviously having less of them to search through will help, but having
>> less than the L2 cache-size worth of 'em seems to help a whole lot!
>
> Could you demonstrate that point by showing us timings for shared_buffers
> sizes from 512K up to, say, 2 MB? The two numbers you give there might
> just have to do with managing a large buffer.

Yeah - good point:

PIII 1.26 Ghz 512Kb L2 cache 2G RAM

Test is elapsed time for: SELECT count(*) FROM lineitem

lineitem has 1535724 pages (11997 MB)

Shared Buffers Elapsed IO rate (from vmstat)
-------------- ------- ---------------------
400MB 101 s 122 MB/s

2MB 100 s
1MB 97 s
768KB 93 s
512KB 86 s
256KB 77 s
128KB 74 s 166 MB/s

I've added the observed IO rate for the two extreme cases (the rest can
be pretty much deduced via interpolation).

Note that the system will do about 220 MB/s with the now (in)famous dd
test, so we have a bit of headroom (not too bad for a PIII).

Cheers

Mark


From: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
To: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 10:04:09
Message-ID: C3E62232E3BCF24CBA20D72BFDCB6BF802CFC0D7@MI8NYCMAIL08.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Mark,

> lineitem has 1535724 pages (11997 MB)
>
> Shared Buffers Elapsed IO rate (from vmstat)
> -------------- ------- ---------------------
> 400MB 101 s 122 MB/s
>
> 2MB 100 s
> 1MB 97 s
> 768KB 93 s
> 512KB 86 s
> 256KB 77 s
> 128KB 74 s 166 MB/s
>
> I've added the observed IO rate for the two extreme cases
> (the rest can be pretty much deduced via interpolation).
>
> Note that the system will do about 220 MB/s with the now
> (in)famous dd test, so we have a bit of headroom (not too bad
> for a PIII).

What's really interesting: try this with a table that fits into I/O
cache (say half your system memory), and run VACUUM on the table. That
way the effect will stand out more dramatically.

- Luke


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Grzegorz Jaskiewicz" <gj(at)pointblue(dot)com(dot)pl>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 10:10:47
Message-ID: 87slckdmqg.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Luke Lonergan" <LLonergan(at)greenplum(dot)com> writes:

> The evidence seems to clearly indicate reduced memory writing due to an
> L2 related effect.

You might try using valgrind's cachegrind tool which I understand can actually
emulate various processors' cache to show how efficiently code uses it. I
haven't done much with it though so I don't know how applicable it would be to
a large-scale effect like this.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: Gavin Sherry <swm(at)alcove(dot)com(dot)au>, Luke Lonergan <llonergan(at)greenplum(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 16:53:36
Message-ID: 19602.1173113616@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
> Shared Buffers Elapsed IO rate (from vmstat)
> -------------- ------- ---------------------
> 400MB 101 s 122 MB/s
> 2MB 100 s
> 1MB 97 s
> 768KB 93 s
> 512KB 86 s
> 256KB 77 s
> 128KB 74 s 166 MB/s

Hm, that seems to blow the "it's an L2 cache effect" theory out of the
water. If it were a cache effect then there should be a performance
cliff at the point where the cache size is exceeded. I see no such
cliff, in fact the middle part of the curve is darn near a straight
line on a log scale ...

So I'm back to asking what we're really measuring here. Buffer manager
inefficiency of some sort, but what? Have you tried oprofile?

regards, tom lane


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 17:22:29
Message-ID: C21191D5.2875A%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Tom,

On 3/5/07 8:53 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Hm, that seems to blow the "it's an L2 cache effect" theory out of the
> water. If it were a cache effect then there should be a performance
> cliff at the point where the cache size is exceeded. I see no such
> cliff, in fact the middle part of the curve is darn near a straight
> line on a log scale ...
>
> So I'm back to asking what we're really measuring here. Buffer manager
> inefficiency of some sort, but what? Have you tried oprofile?

How about looking at the CPU performance counters directly using cpustat:
cpustat -c BU_fill_into_L2,umask=0x1 1

This shows us how many L2 fills there are on all four cores (we use all
four). In the case without buffer cache pollution, below is the trace of L2
fills. In the pollution case we fill 27 million lines, in the pollution
case we fill 44 million lines.

VACUUM orders (no buffer pollution):
51.006 1 tick 2754293
51.006 2 tick 3159565
51.006 3 tick 2971929
51.007 0 tick 3577487
52.006 1 tick 4214179
52.006 3 tick 3650193
52.006 2 tick 3905828
52.007 0 tick 3465261
53.006 1 tick 1818766
53.006 3 tick 1546018
53.006 2 tick 1709385
53.007 0 tick 1483371

And here is the case with buffer pollution:
VACUUM orders (with buffer pollution)
22.006 0 tick 1576114
22.006 1 tick 1542604
22.006 2 tick 1987366
22.006 3 tick 1784567
23.006 3 tick 2706059
23.006 2 tick 2362048
23.006 0 tick 2190719
23.006 1 tick 2088827
24.006 0 tick 2247473
24.006 1 tick 2153850
24.006 2 tick 2422730
24.006 3 tick 2758795
25.006 0 tick 2419436
25.006 1 tick 2229602
25.006 2 tick 2619333
25.006 3 tick 2712332
26.006 1 tick 1827923
26.006 2 tick 1886556
26.006 3 tick 2909746
26.006 0 tick 1467164


From: "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 17:22:57
Message-ID: 45EC51F1.7050602@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
>
>> Shared Buffers Elapsed IO rate (from vmstat)
>> -------------- ------- ---------------------
>> 400MB 101 s 122 MB/s
>> 2MB 100 s
>> 1MB 97 s
>> 768KB 93 s
>> 512KB 86 s
>> 256KB 77 s
>> 128KB 74 s 166 MB/s
>>
>
> So I'm back to asking what we're really measuring here. Buffer manager
> inefficiency of some sort, but what? Have you tried oprofile?
>
Isn't the size of the shared buffer pool itself acting as a performance
penalty in this case ? May be StrategyGetBuffer() needs to make multiple
passes over the buffers before the usage_count of any buffer is reduced
to zero and the buffer is chosen as replacement victim.

There is no real advantage of having larger shared buffer pool in this
particular test. A heap buffer is hardly accessed again once the seqscan
passes over it. Can we try with a smaller value for
BM_MAX_USAGE_COUNT and see if that has any positive impact
for large shared pool in this case ?

Thanks,
Pavan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>
Cc: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 17:44:25
Message-ID: 20200.1173116665@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Pavan Deolasee" <pavan(at)enterprisedb(dot)com> writes:
> Isn't the size of the shared buffer pool itself acting as a performance
> penalty in this case ? May be StrategyGetBuffer() needs to make multiple
> passes over the buffers before the usage_count of any buffer is reduced
> to zero and the buffer is chosen as replacement victim.

I read that and thought you were onto something, but it's not acting
quite the way I expect. I made a quick hack in StrategyGetBuffer() to
count the number of buffers it looks at before finding a victim.
Running it with just 32 buffers on a large count(*), the behavior after
the initial startup transient is quite odd:

got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries

Yes, autovacuum is off, and bgwriter shouldn't have anything useful to
do either, so I'm a bit at a loss what's going on --- but in any case,
it doesn't look like we are cycling through the entire buffer space
for each fetch.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 18:20:26
Message-ID: 200703051020.26226.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

> Yes, autovacuum is off, and bgwriter shouldn't have anything useful to
> do either, so I'm a bit at a loss what's going on --- but in any case,
> it doesn't look like we are cycling through the entire buffer space
> for each fetch.

I'd be happy to DTrace it, but I'm a little lost as to where to look in the
kernel. I'll see if I can find someone who knows more about memory
management than me (that ought to be easy).

--
Josh Berkus
PostgreSQL @ Sun
San Francisco


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 18:21:05
Message-ID: C2119F91.28776%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

On 3/5/07 8:53 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Hm, that seems to blow the "it's an L2 cache effect" theory out of the
> water. If it were a cache effect then there should be a performance
> cliff at the point where the cache size is exceeded. I see no such
> cliff, in fact the middle part of the curve is darn near a straight
> line on a log scale ...

Here's that cliff you were looking for:

Size of Orders table: 7178MB
Blocksize: 8KB

Shared_buffers Select Count Vacuum
(KB) (s) (s)
=======================================
248 5.52 2.46
368 4.77 2.40
552 5.82 2.40
824 6.20 2.43
1232 5.60 3.59
1848 6.02 3.14
2768 5.53 4.56

All of these were run three times and the *lowest* time reported. Also, the
behavior of "fast VACUUM after SELECT" begins abruptly at 1232KB of
shared_buffers.

These are Opterons with 2MB of L2 cache shared between two cores.

- Luke


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>
Cc: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 18:24:45
Message-ID: 20614.1173119085@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> "Pavan Deolasee" <pavan(at)enterprisedb(dot)com> writes:
>> Isn't the size of the shared buffer pool itself acting as a performance
>> penalty in this case ? May be StrategyGetBuffer() needs to make multiple
>> passes over the buffers before the usage_count of any buffer is reduced
>> to zero and the buffer is chosen as replacement victim.

> I read that and thought you were onto something, but it's not acting
> quite the way I expect. I made a quick hack in StrategyGetBuffer() to
> count the number of buffers it looks at before finding a victim.
> ...
> Yes, autovacuum is off, and bgwriter shouldn't have anything useful to
> do either, so I'm a bit at a loss what's going on --- but in any case,
> it doesn't look like we are cycling through the entire buffer space
> for each fetch.

Nope, Pavan's nailed it: the problem is that after using a buffer, the
seqscan leaves it with usage_count = 1, which means it has to be passed
over once by the clock sweep before it can be re-used. I was misled in
the 32-buffer case because catalog accesses during startup had left the
buffer state pretty confused, so that there was no long stretch before
hitting something available. With a large number of buffers, the
behavior is that the seqscan fills all of shared memory with buffers
having usage_count 1. Once the clock sweep returns to the first of
these buffers, it will have to pass over all of them, reducing all of
their counts to 0, before it returns to the first one and finds it now
usable. Subsequent tries find a buffer immediately, of course, until we
have again filled shared_buffers with usage_count 1 everywhere. So the
problem is not so much the clock sweep overhead as that it's paid in a
very nonuniform fashion: with N buffers you pay O(N) once every N reads
and O(1) the rest of the time. This is no doubt slowing things down
enough to delay that one read, instead of leaving it nicely I/O bound
all the time. Mark, can you detect "hiccups" in the read rate using
your setup?

I seem to recall that we've previously discussed the idea of letting the
clock sweep decrement the usage_count before testing for 0, so that a
buffer could be reused on the first sweep after it was initially used,
but that we rejected it as being a bad idea. But at least with large
shared_buffers it doesn't sound like such a bad idea.

Another issue nearby to this is whether to avoid selecting buffers that
are dirty --- IIRC someone brought that up again recently. Maybe
predecrement for clean buffers, postdecrement for dirty ones would be a
cute compromise.

regards, tom lane


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 18:44:51
Message-ID: C211A523.28788%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Here's four more points on the curve - I'd use a "dirac delta function" for
your curve fit ;-)

Shared_buffers Select Count Vacuum
(KB) (s) (s)
=======================================
248 5.52 2.46
368 4.77 2.40
552 5.82 2.40
824 6.20 2.43
1232 5.60 3.59
1848 6.02 3.14
2768 5.53 4.56
5536 6.05 3.95
8304 5.80 4.37
12456 5.86 4.12
18680 5.83 4.10
28016 6.11 4.46

WRT what you found on the selection algorithm, it might also explain the L2
effects I think.

I'm also still of the opinion that polluting the shared buffer cache for a
seq scan does not make sense.

- Luke

On 3/5/07 10:21 AM, "Luke Lonergan" <llonergan(at)greenplum(dot)com> wrote:

> Tom,
>
> On 3/5/07 8:53 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Hm, that seems to blow the "it's an L2 cache effect" theory out of the
>> water. If it were a cache effect then there should be a performance
>> cliff at the point where the cache size is exceeded. I see no such
>> cliff, in fact the middle part of the curve is darn near a straight
>> line on a log scale ...
>
> Here's that cliff you were looking for:
>
> Size of Orders table: 7178MB
> Blocksize: 8KB
>
> Shared_buffers Select Count Vacuum
> (KB) (s) (s)
> =======================================
> 248 5.52 2.46
> 368 4.77 2.40
> 552 5.82 2.40
> 824 6.20 2.43
> 1232 5.60 3.59
> 1848 6.02 3.14
> 2768 5.53 4.56
>
> All of these were run three times and the *lowest* time reported. Also, the
> behavior of "fast VACUUM after SELECT" begins abruptly at 1232KB of
> shared_buffers.
>
> These are Opterons with 2MB of L2 cache shared between two cores.
>
> - Luke
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


From: "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 18:46:15
Message-ID: 45EC6577.7050402@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>
> Nope, Pavan's nailed it: the problem is that after using a buffer, the
> seqscan leaves it with usage_count = 1, which means it has to be passed
> over once by the clock sweep before it can be re-used. I was misled in
> the 32-buffer case because catalog accesses during startup had left the
> buffer state pretty confused, so that there was no long stretch before
> hitting something available. With a large number of buffers, the
> behavior is that the seqscan fills all of shared memory with buffers
> having usage_count 1. Once the clock sweep returns to the first of
> these buffers, it will have to pass over all of them, reducing all of
> their counts to 0, before it returns to the first one and finds it now
> usable. Subsequent tries find a buffer immediately, of course, until we
> have again filled shared_buffers with usage_count 1 everywhere. So the
> problem is not so much the clock sweep overhead as that it's paid in a
> very nonuniform fashion: with N buffers you pay O(N) once every N reads
> and O(1) the rest of the time. This is no doubt slowing things down
> enough to delay that one read, instead of leaving it nicely I/O bound
> all the time. Mark, can you detect "hiccups" in the read rate using
> your setup?
>
>

Cool. You posted the same analysis before I could hit the "send" button :)

I am wondering whether seqscan would set the usage_count to 1 or to a higher
value. usage_count is incremented while unpinning the buffer. Even if
we use
page-at-a-time mode, won't the buffer itself would get pinned/unpinned
every time seqscan returns a tuple ? If thats the case, the overhead would
be O(BM_MAX_USAGE_COUNT * N) for every N reads.

> I seem to recall that we've previously discussed the idea of letting the
> clock sweep decrement the usage_count before testing for 0, so that a
> buffer could be reused on the first sweep after it was initially used,
> but that we rejected it as being a bad idea. But at least with large
> shared_buffers it doesn't sound like such a bad idea.
>
>
How about smaller value for BM_MAX_USAGE_COUNT ?

> Another issue nearby to this is whether to avoid selecting buffers that
> are dirty --- IIRC someone brought that up again recently. Maybe
> predecrement for clean buffers, postdecrement for dirty ones would be a
> cute compromise.
>
Can we use a 2-bit counter where the higher bit is set if the buffer is
dirty
and lower bit is set whenever the buffer is used. The clock-sweep then
decrement this counter and chooses a victim with counter value 0.
ISTM that we should optimize for large shared buffer pool case,
because that would be more common in the coming days. RAM is
getting cheaper everyday.

Thanks,
Pavan


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 18:46:16
Message-ID: 200703051046.16281.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

> I seem to recall that we've previously discussed the idea of letting the
> clock sweep decrement the usage_count before testing for 0, so that a
> buffer could be reused on the first sweep after it was initially used,
> but that we rejected it as being a bad idea.  But at least with large
> shared_buffers it doesn't sound like such a bad idea.

We did discuss an number of formulas for setting buffers with different
clock-sweep numbers, including ones with higher usage_count for indexes and
starting numbers of 0 for large seq scans as well as vacuums. However, we
didn't have any way to prove that any of these complex algorithms would
result in higher performance, so went with the simplest formula, with the
idea of tinkering with it when we had more data. So maybe now's the time.

Note, though, that the current algorithm is working very, very well for OLTP
benchmarks, so we'd want to be careful not to gain performance in one area at
the expense of another. In TPCE testing, we've been able to increase
shared_buffers to 10GB with beneficial performance effect (numbers posted
when I have them) and even found that "taking over RAM" with the
shared_buffers (ala Oracle) gave us equivalent performance to using the FS
cache. (yes, this means with a little I/O management engineering we could
contemplate discarding use of the FS cache for a net performance gain. Maybe
for 8.4)

--
Josh Berkus
PostgreSQL @ Sun
San Francisco


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>
Cc: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 18:55:46
Message-ID: 20896.1173120946@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Pavan Deolasee" <pavan(at)enterprisedb(dot)com> writes:
> I am wondering whether seqscan would set the usage_count to 1 or to a higher
> value. usage_count is incremented while unpinning the buffer. Even if
> we use
> page-at-a-time mode, won't the buffer itself would get pinned/unpinned
> every time seqscan returns a tuple ? If thats the case, the overhead would
> be O(BM_MAX_USAGE_COUNT * N) for every N reads.

No, it's only once per page. There's a good deal of PrivateRefCount
thrashing that goes on while examining the individual tuples, but the
shared state only changes when we leave the page, because we hold pin
continuously on the current page of a seqscan. If you don't believe
me, insert some debug printouts for yourself.

> How about smaller value for BM_MAX_USAGE_COUNT ?

This is not relevant to the problem: we are concerned about usage count
1 versus 0, not the other end of the range.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 19:02:32
Message-ID: 87lkibcy47.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> I seem to recall that we've previously discussed the idea of letting the
> clock sweep decrement the usage_count before testing for 0, so that a
> buffer could be reused on the first sweep after it was initially used,
> but that we rejected it as being a bad idea. But at least with large
> shared_buffers it doesn't sound like such a bad idea.

I seem to recall the classic clock sweep algorithm was to just use a single
bit. Either a buffer was touched recently or it wasn't.

I also vaguely recall a refinement involving keeping a bitmask and shifting it
right each time the clock hand comes around. So you have a precise history of
which recent clock sweeps the buffer was used and which it wasn't.

I think the coarseness of not caring how heavily it was used is a key part of
the algorithm. By not caring if it was lightly or heavily used during the
clock sweep, just that it was used at least once it avoids making sticking
with incorrect deductions about things like sequential scans even if multiple
sequential scans pass by. As soon as they stop seeing the buffer it
immediately reacts and discards the buffer.

I would check out my OS book from school but it's on another continent :(

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Josh Berkus" <josh(at)agliodbs(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 19:34:50
Message-ID: 1173123291.3760.2017.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote:
> Tom,
>
> > I seem to recall that we've previously discussed the idea of letting the
> > clock sweep decrement the usage_count before testing for 0, so that a
> > buffer could be reused on the first sweep after it was initially used,
> > but that we rejected it as being a bad idea. But at least with large
> > shared_buffers it doesn't sound like such a bad idea.

> Note, though, that the current algorithm is working very, very well for OLTP
> benchmarks, so we'd want to be careful not to gain performance in one area at
> the expense of another.

Agreed.

What we should also add to the analysis is that this effect only occurs
when only uniform workloads is present, like SeqScan, VACUUM or COPY.
When you have lots of indexed access the scan workloads don't have as
much effect on the cache pollution as we are seeing in these tests.

Itakgaki-san and I were discussing in January the idea of cache-looping,
whereby a process begins to reuse its own buffers in a ring of ~32
buffers. When we cycle back round, if usage_count==1 then we assume that
we can reuse that buffer. This avoids cache swamping for read and write
workloads, plus avoids too-frequent WAL writing for VACUUM.

It would be simple to implement the ring buffer and enable/disable it
with a hint StrategyHintCyclicBufferReuse() in a similar manner to the
hint VACUUM provides now.

This would maintain the beneficial behaviour for OLTP, while keeping
data within the L2 cache for DSS and bulk workloads.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Josh Berkus" <josh(at)agliodbs(dot)com>, "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 19:41:45
Message-ID: 21341.1173123705@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> Itakgaki-san and I were discussing in January the idea of cache-looping,
> whereby a process begins to reuse its own buffers in a ring of ~32
> buffers. When we cycle back round, if usage_count==1 then we assume that
> we can reuse that buffer. This avoids cache swamping for read and write
> workloads, plus avoids too-frequent WAL writing for VACUUM.

> This would maintain the beneficial behaviour for OLTP,

Justify that claim. It sounds to me like this would act very nearly the
same as having shared_buffers == 32 ...

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Josh Berkus" <josh(at)agliodbs(dot)com>, "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 20:06:18
Message-ID: 1173125179.3760.2045.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > Itakgaki-san and I were discussing in January the idea of cache-looping,
> > whereby a process begins to reuse its own buffers in a ring of ~32
> > buffers. When we cycle back round, if usage_count==1 then we assume that
> > we can reuse that buffer. This avoids cache swamping for read and write
> > workloads, plus avoids too-frequent WAL writing for VACUUM.
>
> > This would maintain the beneficial behaviour for OLTP,
>
> Justify that claim. It sounds to me like this would act very nearly the
> same as having shared_buffers == 32 ...

Sure. We wouldn't set the hint for IndexScans or Inserts, only for
SeqScans, VACUUM and COPY.

So OLTP-only workloads would be entirely unaffected. In the presence of
a mixed workload the scan tasks would have only a limited effect on the
cache, maintaining performance for the response time critical tasks. So
its an OLTP benefit because of cache protection and WAL-flush reduction
during VACUUM.

As we've seen, the scan tasks look like they'll go faster with this.

The assumption that we can reuse the buffer if usage_count<=1 seems
valid. If another user had requested the block, then the usage_count
would be > 1, unless the buffer has been used, unpinned and then a cycle
of the buffer cache had spun round, all within the time taken to process
32 blocks sequentially. We do have to reuse one of the buffers, so
cyclical reuse seems like a better bet most of the time than more
arbitrary block reuse, as we see in a larger cache.

Best way is to prove it though. Seems like not too much work to have a
private ring data structure when the hint is enabled. The extra
bookeeping is easily going to be outweighed by the reduction in mem->L2
cache fetches. I'll do it tomorrow, if no other volunteers.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Luke Lonergan <LLonergan(at)greenplum(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 20:11:37
Message-ID: 1173125497.13722.315.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-03-05 at 03:51 -0500, Luke Lonergan wrote:
> The Postgres shared buffer cache algorithm appears to have a bug. When
> there is a sequential scan the blocks are filling the entire shared
> buffer cache. This should be "fixed".
>
> My proposal for a fix: ensure that when relations larger (much larger?)
> than buffer cache are scanned, they are mapped to a single page in the
> shared buffer cache.
>

I don't see why we should strictly limit sequential scans to one buffer
per scan. I assume you mean one buffer per scan, but that raises these
two questions:

(1) What happens when there are more seq scans than cold buffers
available?
(2) What happens when two sequential scans need the same page, do we
double-up?

Also, the first time we access any heap page of a big table, we are very
unsure whether we will access it again, regardless of whether it's part
of a seq scan or not.

In our current system of 4 LRU lists (depending on how many times a
buffer has been referenced), we could start "more likely" (e.g. system
catalogs, index pages) pages in higher list, and heap reads from big
tables in the lowest possible list. Assuming, of course, that has any
benefit (frequently accessed cache pages are likely to move up in the
lists very quickly anyway).

But I don't think we should eliminate caching of heap pages in big
tables all together. A few buffers might be polluted during the scan,
but most of the time they will be replacing other low-priority pages
(perhaps from another seq scan) and probably be replaced again quickly.
If that's not happening, and it's polluting frequently-accessed pages, I
agree that's a bug.

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Hannu Krosing <hannu(at)skype(dot)net>
Cc: Luke Lonergan <LLonergan(at)greenplum(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 20:12:34
Message-ID: 1173125554.13722.318.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-03-05 at 11:10 +0200, Hannu Krosing wrote:
> > My proposal for a fix: ensure that when relations larger (much larger?)
> > than buffer cache are scanned, they are mapped to a single page in the
> > shared buffer cache.
>
> How will this approach play together with synchronized scan patches ?
>

Thanks for considering my patch in this discussion. I will test by
turning shared_buffers down as low as I can, and see if that makes a big
difference.

> Or should synchronized scan rely on systems cache only ?
>

I don't know what the performance impact of that will be; still good
compared to reading from disk, but I assume much more CPU time.

Regards,
Jeff Davis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Josh Berkus" <josh(at)agliodbs(dot)com>, "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 20:18:47
Message-ID: 21794.1173125927@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> Best way is to prove it though. Seems like not too much work to have a
> private ring data structure when the hint is enabled. The extra
> bookeeping is easily going to be outweighed by the reduction in mem->L2
> cache fetches. I'll do it tomorrow, if no other volunteers.

[ shrug... ] No one has yet proven to my satisfaction that L2 cache has
anything to do with this. The notion that you can read a new disk page
into a shared buffer and have that buffer still be live in the processor
cache is so obviously bogus that I think there must be some other effect
at work.

regards, tom lane


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Luke Lonergan <LLonergan(at)greenplum(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 20:19:51
Message-ID: 1173125991.13722.327.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-03-05 at 09:09 +0000, Heikki Linnakangas wrote:
> In fact, the pages that are left in the cache after the seqscan finishes
> would be useful for the next seqscan of the same table if we were smart
> enough to read those pages first. That'd make a big difference for
> seqscanning a table that's say 1.5x your RAM size. Hmm, I wonder if
> Jeff's sync seqscan patch adresses that.
>

Absolutely. I've got a parameter in my patch "sync_scan_offset" that
starts a seq scan N pages before the position of the last seq scan
running on that table (or a current seq scan if there's still a scan
going).

If the last scan is not still in progress, the pages are less likely to
be in cache. If the pages are in cache, great; if not, it doesn't matter
where we start anyway.

If the last scan is still in progress, those recently-read pages are
very likely to be in cache (shared buffers or OS buffer cache).

Regards,
Jeff Davis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 20:30:28
Message-ID: 21976.1173126628@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> Absolutely. I've got a parameter in my patch "sync_scan_offset" that
> starts a seq scan N pages before the position of the last seq scan
> running on that table (or a current seq scan if there's still a scan
> going).

Strikes me that expressing that parameter as a percentage of
shared_buffers might make it less in need of manual tuning ...

regards, tom lane


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 20:45:04
Message-ID: 1173127504.13722.335.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > Absolutely. I've got a parameter in my patch "sync_scan_offset" that
> > starts a seq scan N pages before the position of the last seq scan
> > running on that table (or a current seq scan if there's still a scan
> > going).
>
> Strikes me that expressing that parameter as a percentage of
> shared_buffers might make it less in need of manual tuning ...
>

The original patch was a percentage of effective_cache_size, because in
theory it may be helpful to have this parameter larger than shared
buffers. Synchronized Scannning can take advantage of OS buffer cache as
well.

Someone convinced me to change it to be an independent variable.

I don't have a strong opinion, but now I have three different opinions:
(1) Independent parameter
(2) Percentage of shared_buffers
(3) Percentage of effective_cache_size

Regards,
Jeff Davis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 21:03:10
Message-ID: 22436.1173128590@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote:
>> Strikes me that expressing that parameter as a percentage of
>> shared_buffers might make it less in need of manual tuning ...

> The original patch was a percentage of effective_cache_size, because in
> theory it may be helpful to have this parameter larger than shared
> buffers. Synchronized Scannning can take advantage of OS buffer cache as
> well.

I didn't say you couldn't allow it to be more than 100% ;-). But basing
it on effective_cache_size strikes me as a bad idea because that parameter
is seldom better than a wild guess. shared_buffers at least means
something.

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 21:03:48
Message-ID: 45EC85B4.2060009@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis wrote:
> On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote:
>> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
>>> Absolutely. I've got a parameter in my patch "sync_scan_offset" that
>>> starts a seq scan N pages before the position of the last seq scan
>>> running on that table (or a current seq scan if there's still a scan
>>> going).
>> Strikes me that expressing that parameter as a percentage of
>> shared_buffers might make it less in need of manual tuning ...
>>
>
> The original patch was a percentage of effective_cache_size, because in
> theory it may be helpful to have this parameter larger than shared
> buffers. Synchronized Scannning can take advantage of OS buffer cache as
> well.
>
> Someone convinced me to change it to be an independent variable.
>
> I don't have a strong opinion, but now I have three different opinions:
> (1) Independent parameter
> (2) Percentage of shared_buffers
> (3) Percentage of effective_cache_size

Another approach I proposed back in December is to not have a variable
like that at all, but scan the buffer cache for pages belonging to the
table you're scanning to initialize the scan. Scanning all the
BufferDescs is a fairly CPU and lock heavy operation, but it might be ok
given that we're talking about large I/O bound sequential scans. It
would require no DBA tuning and would work more robustly in varying
conditions. I'm not sure where you would continue after scanning the
in-cache pages. At the highest in-cache block number, perhaps.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavan Deolasee <pavan(at)enterprisedb(dot)com>, Gavin Sherry <swm(at)alcove(dot)com(dot)au>, Luke Lonergan <llonergan(at)greenplum(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 21:28:21
Message-ID: 45EC8B75.9040806@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> So the
> problem is not so much the clock sweep overhead as that it's paid in a
> very nonuniform fashion: with N buffers you pay O(N) once every N reads
> and O(1) the rest of the time. This is no doubt slowing things down
> enough to delay that one read, instead of leaving it nicely I/O bound
> all the time. Mark, can you detect "hiccups" in the read rate using
> your setup?
>

I think so, here's the vmstat output for 400MB of shared_buffers during
the scan:

procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
1 0 764 51772 0 1990688 0 0 120422 2 1546 1755 16
37 46 1
1 0 764 53640 0 1988792 0 0 120422 2 1544 1446 14
40 46 1
1 0 788 54900 0 1987564 0 0 116746 15 1470 3067 15
39 44 2
1 0 788 52800 0 1989552 0 0 119199 20 1488 2216 14
37 47 1
1 0 788 52372 0 1990000 0 0 122880 7 1532 1203 15
39 45 1
1 0 788 54592 0 1987872 0 5 124928 5 1557 1058 17
38 46 0
2 0 788 54052 0 1987836 0 0 118787 0 1500 2469 16
36 47 1
1 0 788 52552 0 1989892 0 0 120419 0 1506 2531 15
36 48 1
1 0 788 53452 0 1989356 0 0 119195 2 1501 1698 15
37 47 1
1 0 788 52680 0 1989796 0 0 120424 2 1521 1610 16
37 47 1

Cheers

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: Pavan Deolasee <pavan(at)enterprisedb(dot)com>, Gavin Sherry <swm(at)alcove(dot)com(dot)au>, Luke Lonergan <llonergan(at)greenplum(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 21:36:39
Message-ID: 22773.1173130599@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
> Tom Lane wrote:
>> Mark, can you detect "hiccups" in the read rate using
>> your setup?

> I think so, here's the vmstat output for 400MB of shared_buffers during
> the scan:

Hm, not really a smoking gun there. But just for grins, would you try
this patch and see if the numbers change?

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 785 bytes

From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavan Deolasee <pavan(at)enterprisedb(dot)com>, Gavin Sherry <swm(at)alcove(dot)com(dot)au>, Luke Lonergan <llonergan(at)greenplum(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 22:07:51
Message-ID: 45EC94B7.3080303@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

>
> Hm, not really a smoking gun there. But just for grins, would you try
> this patch and see if the numbers change?
>

Applied to 8.2.3 (don't have lineitem loaded in HEAD yet) - no change
that I can see:

procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
1 0 764 55300 0 1988032 0 0 118797 32 1532 1775 15
39 44 2
1 0 764 54476 0 1989720 0 0 115507 9 1456 3970 15
39 45 1
1 0 788 54540 0 1989592 0 0 121651 0 1508 3221 16
37 47 0
1 0 788 52808 0 1991320 0 0 124109 0 1532 1236 15
38 46 0
1 0 788 52504 0 1991784 0 0 124518 0 1547 1005 16
39 45 0
2 0 788 54544 0 1989740 0 5 117965 5 1491 2012 15
36 47 2
1 0 788 53596 0 1991184 0 0 120424 0 1504 1910 16
37 46 1

Elapsed time is exactly the same (101 s). Is is expected that HEAD would
behave differently?

Cheers

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: Pavan Deolasee <pavan(at)enterprisedb(dot)com>, Gavin Sherry <swm(at)alcove(dot)com(dot)au>, Luke Lonergan <llonergan(at)greenplum(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 22:19:58
Message-ID: 23504.1173133198@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
> Elapsed time is exactly the same (101 s). Is is expected that HEAD would
> behave differently?

Offhand I don't think so. But what I wanted to see was the curve of
elapsed time vs shared_buffers?

regards, tom lane


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 22:43:02
Message-ID: 1173134582.13722.385.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-03-05 at 21:03 +0000, Heikki Linnakangas wrote:
> Another approach I proposed back in December is to not have a variable
> like that at all, but scan the buffer cache for pages belonging to the
> table you're scanning to initialize the scan. Scanning all the
> BufferDescs is a fairly CPU and lock heavy operation, but it might be ok
> given that we're talking about large I/O bound sequential scans. It
> would require no DBA tuning and would work more robustly in varying
> conditions. I'm not sure where you would continue after scanning the
> in-cache pages. At the highest in-cache block number, perhaps.
>

I assume you're referring to this:

"each backend keeps a bitmap of pages it has processed during the scan,
and read the pages in the order they're available in cache."

which I think is a great idea. However, I was unable to devise a good
answer to all these questions at once:

* How do we attempt to maintain sequential reads on the underlying I/O
layer?

* My current implementation takes advantage of the OS buffer cache, how
could we maintain that advantage from PostgreSQL-specific cache logic?

* How do I test to see whether it actually helps in a realistic
scenario? It seems like it would help the most when scans are
progressing at different rates, but how often do people have CPU-bound
queries on tables that don't fit into physical memory (and how long
would it take for me to benchmark such a query)?

It seems like your idea is more analytical, and my current
implementation is more guesswork. I like the analytical approach, but I
don't know that we have enough information to pull it off because we're
missing what's in the OS buffer cache. The OS buffer cache is crucial to
Synchronized Scanning, because shared buffers are evicted based on a
more complex set of circumstances, whereas the OS buffer cache is
usually LRU and forms a nicer "cache trail" (upon which Synchronized
Scanning is largely based).

If you have some tests you'd like me to run, I'm planning to do some
benchmarks this week and next. I can see if my current patch holds up
under the scenarios you're worried about.

Regards,
Jeff Davis


From: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, Pavan Deolasee <pavan(at)enterprisedb(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Gavin Sherry <swm(at)alcove(dot)com(dot)au>, Luke Lonergan <llonergan(at)greenplum(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-05 23:54:15
Message-ID: 45ECADA7.1050605@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote:
>> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>>> Itakgaki-san and I were discussing in January the idea of cache-looping,
>>> whereby a process begins to reuse its own buffers in a ring of ~32
>>> buffers. When we cycle back round, if usage_count==1 then we assume that
>>> we can reuse that buffer. This avoids cache swamping for read and write
>>> workloads, plus avoids too-frequent WAL writing for VACUUM.
>>> This would maintain the beneficial behaviour for OLTP,
>> Justify that claim. It sounds to me like this would act very nearly the
>> same as having shared_buffers == 32 ...
>
> Sure. We wouldn't set the hint for IndexScans or Inserts, only for
> SeqScans, VACUUM and COPY.
>
> So OLTP-only workloads would be entirely unaffected. In the presence of
> a mixed workload the scan tasks would have only a limited effect on the
> cache, maintaining performance for the response time critical tasks. So
> its an OLTP benefit because of cache protection and WAL-flush reduction
> during VACUUM.
>
> As we've seen, the scan tasks look like they'll go faster with this.

But it would break the idea of letting a second seqscan follow in the
first's hot cache trail, no?

greetings, Florian Pflug


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavan Deolasee <pavan(at)enterprisedb(dot)com>, Gavin Sherry <swm(at)alcove(dot)com(dot)au>, Luke Lonergan <llonergan(at)greenplum(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 00:06:46
Message-ID: 45ECB096.2050709@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>
> But what I wanted to see was the curve of
> elapsed time vs shared_buffers?
>

Of course! (lets just write that off to me being pre coffee...).

With the patch applied:

Shared Buffers Elapsed vmstat IO rate
-------------- ------- --------------
400MB 101 s 122 MB/s
2MB 101 s
1MB 97 s
768KB 94 s
512KB 84 s
256KB 79 s
128KB 75 s 166 MB/s

Looks *very* similar.

Cheers

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: Pavan Deolasee <pavan(at)enterprisedb(dot)com>, Gavin Sherry <swm(at)alcove(dot)com(dot)au>, Luke Lonergan <llonergan(at)greenplum(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 00:43:07
Message-ID: 24581.1173141787@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
> Tom Lane wrote:
>> But what I wanted to see was the curve of
>> elapsed time vs shared_buffers?
> ...
> Looks *very* similar.

Yup, thanks for checking.

I've been poking into this myself. I find that I can reproduce the
behavior to some extent even with a slow disk drive (this machine is a
dual 2.8GHz Xeon EM64T running Fedora Core 5; the dd-to-dev-null test
shows the disk read speed as 43MB/sec or so). Test case is a
several-gig table, no indexes, fully vacuumed so that neither VACUUM nor
COUNT(*) have to do anything but seqscan as fast as they can. Given a
*freshly started* postmaster, I see

regression=# show shared_buffers;
shared_buffers
----------------
128MB
(1 row)

regression=# \timing
Timing is on.
regression=# vacuum lineitem;
VACUUM
Time: 63652.333 ms
regression=# vacuum lineitem;
VACUUM
Time: 63562.303 ms
regression=# select count(*) from lineitem;
count
----------
10240000
(1 row)

Time: 63142.174 ms
regression=# vacuum lineitem;
VACUUM
Time: 61638.421 ms
regression=# vacuum lineitem;
VACUUM
Time: 61785.905 ms

I didn't show it here, but you can repeat the VACUUM all you want before
the SELECT, and its times are stable; and you can repeat all you want
after the SELECT, and the times are stable but a couple seconds lower.
Restart the postmaster and it goes back to the slower behavior. (I'm
keeping autovac off so it doesn't change the results.)

I decided to get down and dirty with oprofile, and soon found that the
user-space CPU consumption is indistinguishable in both states:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
GLOBAL_POWER_E...|
samples| %|
------------------
141065 73.8193 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux
26368 13.7984 /home/tgl/testversion/bin/postgres
12765 6.6799 /libata
2238 1.1711 /lib64/libc-2.4.so
1112 0.5819 /dm_mod

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
GLOBAL_POWER_E...|
samples| %|
------------------
113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux
26284 16.3070 /home/tgl/testversion/bin/postgres
12004 7.4475 /libata
2093 1.2985 /lib64/libc-2.4.so
996 0.6179 /dm_mod

Inside the kernel, there's only one routine that's significantly different:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples % symbol name
57779 40.9591 copy_user_generic
18175 12.8841 __delay
3994 2.8313 _raw_spin_lock
2388 1.6928 put_page
2184 1.5482 mwait_idle
2083 1.4766 _raw_write_unlock
1909 1.3533 _raw_write_lock

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples % symbol name
37437 33.0783 copy_user_generic
17891 15.8080 __delay
3372 2.9794 _raw_spin_lock
2218 1.9598 mwait_idle
2067 1.8263 _raw_write_unlock
1837 1.6231 _raw_write_lock
1531 1.3527 put_page

So that's part of the mystery: apparently copy_user_generic is coded in
such a way that it's faster to copy into memory that's already in
processor cache. This strikes me as something that probably
could/should be fixed in the kernel; I don't see any good reason why
overwriting a whole cache line oughtn't be the same speed either way.

The other thing that was bothering me is why does the SELECT change
VACUUM's behavior? A debugging printout added to ReadBuffer gave the
answer: after postmaster start, we see things like

read block 353094 into buffer 11386
read block 353095 into buffer 11387
read block 353096 into buffer 11388
read block 353097 into buffer 11389
read block 353098 into buffer 11390
read block 353099 into buffer 11391
read block 353100 into buffer 11392
read block 353101 into buffer 11393
read block 353102 into buffer 11394
read block 353103 into buffer 11395

and after the SELECT it behaves like

read block 336761 into buffer 9403
read block 336762 into buffer 9402
read block 336763 into buffer 9403
read block 336764 into buffer 9402
read block 336765 into buffer 9403
read block 336766 into buffer 9402
read block 336767 into buffer 9403
read block 336768 into buffer 9402
read block 336769 into buffer 9403
read block 336770 into buffer 9402

What's going on is that VACUUM puts each buffer it's finished with on
the tail of the freelist. In the post-SELECT state, there are just two
buffers cycling through the freelist (not sure why not only one, but it
doesn't matter) and so the cache footprint is limited. But immediately
after postmaster start, (nearly) all the buffers are in the freelist and
so they all cycle through VACUUM's usage. In any real-world situation,
of course, the freelist is going to be nearly empty most of the time and
so I don't think this part is worth changing.

So I now concede Luke's argument that this behavior is related to L2
cache usage. But the next question is whether we ought to change
regular seqscan's behavior to mimic VACUUM. I'm very worried about
pessimizing other cases if we do. ISTM there's a fairly clear case that
this might be fixable at the kernel level. Moreover, the issue only
arises because of the way that the copy-from-kernel-buffer-to-userspace
routine behaves, which means that if we go to a regime where we stop
relying on OS caching and ask for direct DMA into our buffers, the
advantage would disappear anyway. Lastly, I think the case where a win
is possible is fairly narrow --- as soon as you've got anything but the
one seqscan going on, it's not going to help.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 01:31:46
Message-ID: 874pozcg3h.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> I don't see any good reason why overwriting a whole cache line oughtn't be
> the same speed either way.

I can think of a couple theories, but I don't know if they're reasonable. The
one the comes to mind is the inter-processor cache coherency protocol. When
writing to a cache line the processor already owns maybe it can skip having to
check for other processors owning that cache line?

What happens if VACUUM comes across buffers that *are* already in the buffer
cache. Does it throw those on the freelist too? That seems like it would be
dangerous if they were in the buffer cache for a reason.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 02:39:27
Message-ID: 25465.1173148767@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> What happens if VACUUM comes across buffers that *are* already in the buffer
> cache. Does it throw those on the freelist too?

Not unless they have usage_count 0, in which case they'd be subject to
recycling by the next clock sweep anyway.

regards, tom lane


From: Jim Nasby <decibel(at)decibel(dot)org>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 04:02:27
Message-ID: 39501601-F6B1-4EA4-8B4B-FA1FB73A657F@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
> Another approach I proposed back in December is to not have a
> variable like that at all, but scan the buffer cache for pages
> belonging to the table you're scanning to initialize the scan.
> Scanning all the BufferDescs is a fairly CPU and lock heavy
> operation, but it might be ok given that we're talking about large
> I/O bound sequential scans. It would require no DBA tuning and
> would work more robustly in varying conditions. I'm not sure where
> you would continue after scanning the in-cache pages. At the
> highest in-cache block number, perhaps.

If there was some way to do that, it'd be what I'd vote for.

Given the partitioning of the buffer lock that Tom did it might not
be that horrible for many cases, either, since you'd only need to
scan through one partition.

We also don't need an exact count, either. Perhaps there's some way
we could keep a counter or something...
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Jim Nasby <decibel(at)decibel(dot)org>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 04:11:14
Message-ID: 1D354595-03AE-49D7-980D-9A87988416E4@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 5, 2007, at 11:46 AM, Josh Berkus wrote:
> Tom,
>
>> I seem to recall that we've previously discussed the idea of
>> letting the
>> clock sweep decrement the usage_count before testing for 0, so that a
>> buffer could be reused on the first sweep after it was initially
>> used,
>> but that we rejected it as being a bad idea. But at least with large
>> shared_buffers it doesn't sound like such a bad idea.
>
> We did discuss an number of formulas for setting buffers with
> different
> clock-sweep numbers, including ones with higher usage_count for
> indexes and
> starting numbers of 0 for large seq scans as well as vacuums.
> However, we
> didn't have any way to prove that any of these complex algorithms
> would
> result in higher performance, so went with the simplest formula,
> with the
> idea of tinkering with it when we had more data. So maybe now's
> the time.
>
> Note, though, that the current algorithm is working very, very well
> for OLTP
> benchmarks, so we'd want to be careful not to gain performance in
> one area at
> the expense of another. In TPCE testing, we've been able to increase
> shared_buffers to 10GB with beneficial performance effect (numbers
> posted
> when I have them) and even found that "taking over RAM" with the
> shared_buffers (ala Oracle) gave us equivalent performance to using
> the FS
> cache. (yes, this means with a little I/O management engineering
> we could
> contemplate discarding use of the FS cache for a net performance
> gain. Maybe
> for 8.4)

An idea I've been thinking about would be to have the bgwriter or
some other background process actually try and keep the free list
populated, so that backends needing to grab a page would be much more
likely to find one there (and not have to wait to scan through the
entire buffer pool, perhaps multiple times).

My thought is to keep track of how many page requests occurred during
a given interval, and use that value (probably averaged over time) to
determine how many pages we'd like to see on the free list. The
background process would then run through the buffers decrementing
usage counts until it found enough for the free list. Before putting
a buffer on the 'free list', it would write the buffer out; I'm not
sure if it would make sense to de-associate the buffer with whatever
it had been storing or not, though. If we don't do that, that would
mean that we could pull pages back off the free list if we wanted to.
That would be helpful if the background process got a bit over-zealous.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jim Nasby <decibel(at)decibel(dot)org>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 07:17:46
Message-ID: 28102.1173165466@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim Nasby <decibel(at)decibel(dot)org> writes:
> An idea I've been thinking about would be to have the bgwriter or
> some other background process actually try and keep the free list
> populated,

The bgwriter already tries to keep pages "just in front" of the clock
sweep pointer clean.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 08:14:54
Message-ID: 1173168895.3760.2054.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-03-06 at 00:54 +0100, Florian G. Pflug wrote:
> Simon Riggs wrote:

> But it would break the idea of letting a second seqscan follow in the
> first's hot cache trail, no?

No, but it would make it somewhat harder to achieve without direct
synchronization between scans. It could still work; lets see.

I'm not sure thats an argument against fixing the problem with the
buffer strategy though. We really want both, not just one or the other.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Jim Nasby <decibel(at)decibel(dot)org>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 17:56:03
Message-ID: 1173203763.13722.414.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
> > Another approach I proposed back in December is to not have a
> > variable like that at all, but scan the buffer cache for pages
> > belonging to the table you're scanning to initialize the scan.
> > Scanning all the BufferDescs is a fairly CPU and lock heavy
> > operation, but it might be ok given that we're talking about large
> > I/O bound sequential scans. It would require no DBA tuning and
> > would work more robustly in varying conditions. I'm not sure where
> > you would continue after scanning the in-cache pages. At the
> > highest in-cache block number, perhaps.
>
> If there was some way to do that, it'd be what I'd vote for.
>

I still don't know how to make this take advantage of the OS buffer
cache.

However, no DBA tuning is a huge advantage, I agree with that.

If I were to implement this idea, I think Heikki's bitmap of pages
already read is the way to go. Can you guys give me some pointers about
how to walk through the shared buffers, reading the pages that I need,
while being sure not to read a page that's been evicted, and also not
potentially causing a performance regression somewhere else?

> Given the partitioning of the buffer lock that Tom did it might not
> be that horrible for many cases, either, since you'd only need to
> scan through one partition.
>
> We also don't need an exact count, either. Perhaps there's some way
> we could keep a counter or something...

Exact count of what? The pages already in cache?

Regards,
Jeff Davis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Jim Nasby <decibel(at)decibel(dot)org>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 17:59:32
Message-ID: 6492.1173203972@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> If I were to implement this idea, I think Heikki's bitmap of pages
> already read is the way to go.

I think that's a good way to guarantee that you'll not finish in time
for 8.3. Heikki's idea is just at the handwaving stage at this point,
and I'm not even convinced that it will offer any win. (Pages in
cache will be picked up by a seqscan already.)

regards, tom lane


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jim Nasby <decibel(at)decibel(dot)org>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 18:09:42
Message-ID: 1173204582.13722.418.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-03-06 at 12:59 -0500, Tom Lane wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > If I were to implement this idea, I think Heikki's bitmap of pages
> > already read is the way to go.
>
> I think that's a good way to guarantee that you'll not finish in time
> for 8.3. Heikki's idea is just at the handwaving stage at this point,
> and I'm not even convinced that it will offer any win. (Pages in
> cache will be picked up by a seqscan already.)
>

I agree that it's a good idea stick with the current implementation
which is, as far as I can see, meeting all of my performance goals.

Regards,
Jeff Davis


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Jim Nasby <decibel(at)decibel(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 18:29:17
Message-ID: 45EDB2FD.4070705@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis wrote:
> On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
>> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
>>> Another approach I proposed back in December is to not have a
>>> variable like that at all, but scan the buffer cache for pages
>>> belonging to the table you're scanning to initialize the scan.
>>> Scanning all the BufferDescs is a fairly CPU and lock heavy
>>> operation, but it might be ok given that we're talking about large
>>> I/O bound sequential scans. It would require no DBA tuning and
>>> would work more robustly in varying conditions. I'm not sure where
>>> you would continue after scanning the in-cache pages. At the
>>> highest in-cache block number, perhaps.
>> If there was some way to do that, it'd be what I'd vote for.
>>
>
> I still don't know how to make this take advantage of the OS buffer
> cache.

Yep, I don't see any way to do that. I think we could live with that,
though. If we went with the sync_scan_offset approach, you'd have to
leave a lot of safety margin in that as well.

> However, no DBA tuning is a huge advantage, I agree with that.
>
> If I were to implement this idea, I think Heikki's bitmap of pages
> already read is the way to go. Can you guys give me some pointers about
> how to walk through the shared buffers, reading the pages that I need,
> while being sure not to read a page that's been evicted, and also not
> potentially causing a performance regression somewhere else?

You could take a look at BufferSync, for example. It walks through the
buffer cache, syncing all dirty buffers.

FWIW, I've attached a function I wrote some time ago when I was playing
with the same idea for vacuums. A call to the new function loops through
the buffer cache and returns the next buffer that belong to a certain
relation. I'm not sure that it's correct and safe, and there's not much
comments, but should work if you want to play with it...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
readanybuffer.patch text/x-patch 1.4 KB

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 18:47:35
Message-ID: 45EDB747.90003@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
>> If I were to implement this idea, I think Heikki's bitmap of pages
>> already read is the way to go.
>
> I think that's a good way to guarantee that you'll not finish in time
> for 8.3. Heikki's idea is just at the handwaving stage at this point,
> and I'm not even convinced that it will offer any win. (Pages in
> cache will be picked up by a seqscan already.)

The scenario that I'm worried about is that you have a table that's
slightly larger than RAM. If you issue many seqscans on that table, one
at a time, every seqscan will have to read the whole table from disk,
even though say 90% of it is in cache when the scan starts.

This can be alleviated by using a large enough sync_scan_offset, but a
single setting like that is tricky to tune, especially if your workload
is not completely constant. Tune it too low, and you don't get much
benefit, tune it too high and your scans diverge and you lose all benefit.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-07 00:28:03
Message-ID: 1173227283.13722.492.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-03-06 at 18:47 +0000, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> >> If I were to implement this idea, I think Heikki's bitmap of pages
> >> already read is the way to go.
> >
> > I think that's a good way to guarantee that you'll not finish in time
> > for 8.3. Heikki's idea is just at the handwaving stage at this point,
> > and I'm not even convinced that it will offer any win. (Pages in
> > cache will be picked up by a seqscan already.)
>
> The scenario that I'm worried about is that you have a table that's
> slightly larger than RAM. If you issue many seqscans on that table, one
> at a time, every seqscan will have to read the whole table from disk,
> even though say 90% of it is in cache when the scan starts.
>

If you're issuing sequential scans one at a time, that 90% of the table
that was cached is probably not cached any more, unless the scans are
close together in time without overlapping (serial sequential scans).
And the problem you describe is no worse than current behavior, where
you have exactly the same problem.

> This can be alleviated by using a large enough sync_scan_offset, but a
> single setting like that is tricky to tune, especially if your workload
> is not completely constant. Tune it too low, and you don't get much
> benefit, tune it too high and your scans diverge and you lose all benefit.
>

I see why you don't want to manually tune this setting, however it's
really not that tricky. You can be quite conservative and still use a
good fraction of your physical memory. I will come up with some numbers
and see how much we have to gain.

Regards,
Jeff Davis


From: Jim Nasby <decibel(at)decibel(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-07 00:38:44
Message-ID: AF45B115-9BF5-47F6-8C05-B2459EE84D4F@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 6, 2007, at 12:17 AM, Tom Lane wrote:
> Jim Nasby <decibel(at)decibel(dot)org> writes:
>> An idea I've been thinking about would be to have the bgwriter or
>> some other background process actually try and keep the free list
>> populated,
>
> The bgwriter already tries to keep pages "just in front" of the clock
> sweep pointer clean.

True, but that still means that each backend has to run the clock-
sweep. AFAICT that's something that backends will serialize on (due
to BufFreelistLock), so it would be best to make StrategyGetBuffer as
fast as possible. It certainly seems like grabbing a buffer off the
free list is going to be a lot faster than running the clock sweep.
That's why I think it'd be better to have the bgwriter run the clock
sweep and put enough buffers on the free list to try and keep up with
demand.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Jim Nasby <decibel(at)decibel(dot)org>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-07 00:43:43
Message-ID: 4E324E75-2AF9-4F2F-A9A9-88248AEF205A@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote:
>> We also don't need an exact count, either. Perhaps there's some way
>> we could keep a counter or something...
>
> Exact count of what? The pages already in cache?

Yes. The idea being if you see there's 10k pages in cache, you can
likely start 9k pages behind the current scan point and still pick
everything up.

But this is nowhere near as useful as the bitmap idea, so I'd only
look at it if it's impossible to make the bitmaps work. And like
others have said, that should wait until there's at least a first-
generation patch that's going to make it into 8.3.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Jim Nasby <decibel(at)decibel(dot)org>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-07 02:10:17
Message-ID: 1173233417.13722.544.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-03-06 at 17:43 -0700, Jim Nasby wrote:
> On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote:
> >> We also don't need an exact count, either. Perhaps there's some way
> >> we could keep a counter or something...
> >
> > Exact count of what? The pages already in cache?
>
> Yes. The idea being if you see there's 10k pages in cache, you can
> likely start 9k pages behind the current scan point and still pick
> everything up.
>
> But this is nowhere near as useful as the bitmap idea, so I'd only
> look at it if it's impossible to make the bitmaps work. And like
> others have said, that should wait until there's at least a first-
> generation patch that's going to make it into 8.3.

You still haven't told me how we take advantage of the OS buffer cache
with the bitmap idea. What makes you think that my current
implementation is "nowhere near as useful as the bitmap idea"?

My current implementation is making use of OS buffers + shared memory;
the bitmap idea can only make use of shared memory, and is likely
throwing the OS buffers away completely.

I also suspect that the bitmap idea relies too much on the idea that
there's a contiguous cache trail in the shared buffers alone. Any
devation from that -- which could be caused by PG's page replacement
algorithm, especially in combination with a varied load pattern -- would
negate any benefit from the bitmap idea. I feel much more confident that
there will exist a trail of pages that are cached in *either* the PG
shared buffers *or* the OS buffer cache. There may be holes/gaps in
either one, but it's much more likely that they combine into a
contiguous series of cached pages. Do you have an idea how I might test
this claim?

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Jim Nasby <decibel(at)decibel(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-07 02:28:30
Message-ID: 1173234511.13722.560.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-03-06 at 18:29 +0000, Heikki Linnakangas wrote:
> Jeff Davis wrote:
> > On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
> >> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
> >>> Another approach I proposed back in December is to not have a
> >>> variable like that at all, but scan the buffer cache for pages
> >>> belonging to the table you're scanning to initialize the scan.
> >>> Scanning all the BufferDescs is a fairly CPU and lock heavy
> >>> operation, but it might be ok given that we're talking about large
> >>> I/O bound sequential scans. It would require no DBA tuning and
> >>> would work more robustly in varying conditions. I'm not sure where
> >>> you would continue after scanning the in-cache pages. At the
> >>> highest in-cache block number, perhaps.
> >> If there was some way to do that, it'd be what I'd vote for.
> >>
> >
> > I still don't know how to make this take advantage of the OS buffer
> > cache.
>
> Yep, I don't see any way to do that. I think we could live with that,
> though. If we went with the sync_scan_offset approach, you'd have to
> leave a lot of safety margin in that as well.
>

Right, there would certainly have to be a safety margin with
sync_scan_offset. However, your plan only works when the shared buffers
are dominated by this sequential scan. Let's say you have 40% of
physical memory for shared buffers, and say that 50% are being used for
hot pages in other parts of the database. That means you have access to
only 20% of physical memory to optimize for this sequential scan, and
20% of the physical memory is basically unavailable (being used for
other parts of the database).

In my current implementation, you could set sync_scan_offset to 1.0
(meaning 1.0 x shared_buffers), giving you 40% of physical memory that
would be used for starting this sequential scan. In this case, that
should be a good margin of error, considering that as much as 80% of the
physical memory might actually be in cache (OS or PG cache).

This all needs to be backed up by testing, of course. I'm just
extrapolating some numbers that look vaguely reasonable to me.

Regards,
Jeff Davis


From: Hannu Krosing <hannu(at)skype(dot)net>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Luke Lonergan <LLonergan(at)greenplum(dot)com>, Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Doug Rady <drady(at)greenplum(dot)com>, Sherry Moore <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-07 07:22:04
Message-ID: 1173252124.3708.30.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Ühel kenal päeval, T, 2007-03-06 kell 18:28, kirjutas Jeff Davis:
> On Tue, 2007-03-06 at 18:29 +0000, Heikki Linnakangas wrote:
> > Jeff Davis wrote:
> > > On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
> > >> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
> > >>> Another approach I proposed back in December is to not have a
> > >>> variable like that at all, but scan the buffer cache for pages
> > >>> belonging to the table you're scanning to initialize the scan.
> > >>> Scanning all the BufferDescs is a fairly CPU and lock heavy
> > >>> operation, but it might be ok given that we're talking about large
> > >>> I/O bound sequential scans. It would require no DBA tuning and
> > >>> would work more robustly in varying conditions. I'm not sure where
> > >>> you would continue after scanning the in-cache pages. At the
> > >>> highest in-cache block number, perhaps.
> > >> If there was some way to do that, it'd be what I'd vote for.
> > >>
> > >
> > > I still don't know how to make this take advantage of the OS buffer
> > > cache.

Maybe it should not ?

Mostly there can be use of OS cache only if it is much bigger than
shared buffer cache. It may make sense to forget about OS cache and just
tell those who can make use of sync scans to set most of memory aside
for shared buffers.

Then we can do better predictions/lookups of how much of a table is
actually in memory.

Dual caching is usually not very beneficial anyway, not to mention about
difficulties in predicting any doual-caching effects.

> > Yep, I don't see any way to do that. I think we could live with that,
> > though. If we went with the sync_scan_offset approach, you'd have to
> > leave a lot of safety margin in that as well.
> >
>
> Right, there would certainly have to be a safety margin with
> sync_scan_offset. However, your plan only works when the shared buffers
> are dominated by this sequential scan. Let's say you have 40% of
> physical memory for shared buffers, and say that 50% are being used for
> hot pages in other parts of the database. That means you have access to
> only 20% of physical memory to optimize for this sequential scan, and
> 20% of the physical memory is basically unavailable (being used for
> other parts of the database).

The simplest thing in case table si much bigger than buffer cache usable
for it is to start the second scan at the point the first scan is
traversing *now*, and hope that the scans will stay together. Or start
at some fixed lag, which makes the first scan to be always the one
issuing reads and second just freerides on buffers already in cache. It
may even be a good idea to throttle the second scan to stay N pages
behind if the OS readahead gets confused when same file is read from
multiple processes.

If the table is smaller than the cache, then just scan it without
syncing.

Trying to read buffers in the same order starting from near the point
where ppages are still in shared buffer cache seems good mostly for case
where table is as big as or just a little larger than cache.

> In my current implementation, you could set sync_scan_offset to 1.0
> (meaning 1.0 x shared_buffers), giving you 40% of physical memory that
> would be used for starting this sequential scan. In this case, that
> should be a good margin of error, considering that as much as 80% of the
> physical memory might actually be in cache (OS or PG cache).
>
> This all needs to be backed up by testing, of course. I'm just
> extrapolating some numbers that look vaguely reasonable to me.

If there is an easy way to tell PG "give me this page only if it is in
shared cache already", then a good approach might be to start 2nd scan
at the point where 1st is now, and move in both directions
simultabeously, like this:

First scan is at page N.

Second scan:

M=N-1
WHILE NOT ALL PAGES ARE READ:
IF PAGE N IS IN CACHE : -- FOLLOW FIRST READER
READ PAGE N
N++
ELSE IF M>=0 AND PAGE M IS IN CACHE : -- READ OLDER CACHED PAGES
READ PAGE M
M--
ELSE IF FIRST READER STILL GOING: -- NO OLDER PAGES, WAIT FOR 1st
WAIT FOR PAGE N TO BECOME AVAILABLE
READ PAGE N
N++
ELSE: -- BECOME 1st reader
READ PAGE N
N++
PROCESS PAGE
--
IF N > PAGES_IF_TABLE: N=0
IF M < 0: M=PAGES_IF_TABLE

This should work reasonably well for LRU caches and it may be made to
work with clock sweep scheme if the sweep arranges pages to purge in
file order.

If we could make the IF PAGE x IS IN CACHE part also know about OS cache
this could also make use of os cache.

Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix
systems ?

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com


From: "Marko Kreen" <markokr(at)gmail(dot)com>
To: "Hannu Krosing" <hannu(at)skype(dot)net>
Cc: "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Jim Nasby" <decibel(at)decibel(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "Grzegorz Jaskiewicz" <gj(at)pointblue(dot)com(dot)pl>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-07 08:49:09
Message-ID: e51f66da0703070049v25d6fbf0ge56ff12a77ec7451@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/7/07, Hannu Krosing <hannu(at)skype(dot)net> wrote:
> Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix
> systems ?

Supposedly you could mmap() a file and then do mincore() on the
area to see which pages are cached.

But you were talking about postgres cache before, there it should
be easily implementable.

--
marko