Quick Links

Re: 2nd Level Buffer Cache

Lists:	pgsql-hackers

From:	Radosław Smogura <rsmogura(at)softperience(dot)eu>
To:	PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	2nd Level Buffer Cache
Date:	2011-03-17 19:47:03
Message-ID:	201103172047.03556.rsmogura@softperience.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I have implemented initial concept of 2nd level cache. Idea is to keep some
segments of shared memory for special buffers (e.g. indices) to prevent
overwrite those by other operations. I added those functionality to nbtree
index scan.

I tested this with doing index scan, seq read, drop system buffers, do index
scan and in few places I saw performance improvements, but actually, I'm not
sure if this was just "random" or intended improvement.

There is few places to optimize code as well, and patch need many work, but
may you see it and give opinions?

Regards,
Radek

Attachment	Content-Type	Size
2nd_lvl_cache.diff	text/x-patch	28.1 KB

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"PG Hackers" <pgsql-hackers(at)postgresql(dot)org>, Rados*aw Smogura <rsmogura(at)softperience(dot)eu>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-17 21:02:18
Message-ID:	4D82308A020000250003BA50@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Rados*aw Smogura<rsmogura(at)softperience(dot)eu> wrote:

> I have implemented initial concept of 2nd level cache. Idea is to
> keep some segments of shared memory for special buffers (e.g.
> indices) to prevent overwrite those by other operations. I added
> those functionality to nbtree index scan.
>
> I tested this with doing index scan, seq read, drop system
> buffers, do index scan and in few places I saw performance
> improvements, but actually, I'm not sure if this was just "random"
> or intended improvement.

I've often wondered about this. In a database I developed back in
the '80s it was clearly a win to have a special cache for index
entries and other special pages closer to the database than the
general cache. A couple things have changed since the '80s (I mean,
besides my waistline and hair color), and PostgreSQL has many
differences from that other database, so I haven't been sure it
would help as much, but I have wondered.

I can't really look at this for a couple weeks, but I'm definitely
interested. I suggest that you add this to the next CommitFest as a
WIP patch, under the Performance category.

https://commitfest.postgresql.org/action/commitfest_view/open

> There is few places to optimize code as well, and patch need many
> work, but may you see it and give opinions?

For something like this it makes perfect sense to show "proof of
concept" before trying to cover everything.

-Kevin

From:	rsmogura <rsmogura(at)softperience(dot)eu>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-18 14:57:48
Message-ID:	ef7df947a33eaa145f94eabe430f7074@mail.softperience.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 17 Mar 2011 16:02:18 -0500, Kevin Grittner wrote:
> Rados*aw Smogura<rsmogura(at)softperience(dot)eu> wrote:
>
>> I have implemented initial concept of 2nd level cache. Idea is to
>> keep some segments of shared memory for special buffers (e.g.
>> indices) to prevent overwrite those by other operations. I added
>> those functionality to nbtree index scan.
>>
>> I tested this with doing index scan, seq read, drop system
>> buffers, do index scan and in few places I saw performance
>> improvements, but actually, I'm not sure if this was just "random"
>> or intended improvement.
>
> I've often wondered about this. In a database I developed back in
> the '80s it was clearly a win to have a special cache for index
> entries and other special pages closer to the database than the
> general cache. A couple things have changed since the '80s (I mean,
> besides my waistline and hair color), and PostgreSQL has many
> differences from that other database, so I haven't been sure it
> would help as much, but I have wondered.
>
> I can't really look at this for a couple weeks, but I'm definitely
> interested. I suggest that you add this to the next CommitFest as a
> WIP patch, under the Performance category.
>
> https://commitfest.postgresql.org/action/commitfest_view/open
>
>> There is few places to optimize code as well, and patch need many
>> work, but may you see it and give opinions?
>
> For something like this it makes perfect sense to show "proof of
> concept" before trying to cover everything.
>
> -Kevin

Yes, there is some change, and I looked at this more carefully, as my
performance results wasn't such as I expected. I found PG uses
BufferAccessStrategy to do sequence scans, so my test query took only 32
buffers from pool and didn't overwritten index pool too much. This BAS
is really surprising. In any case when I end polishing I will send good
patch, with proof.

Actually idea of this patch was like this:
Some operations requires many buffers, PG uses "clock sweep" to get
next free buffer, so it may overwrite index buffer. From point of view
of good database design We should use indices, so purging out index from
cache will affect performance.

As the side effect I saw that this 2nd level keeps pg_* indices in
memory too, so I think to include 3rd level cache for some pg_* tables.

Regards,
Radek

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"rsmogura" <rsmogura(at)softperience(dot)eu>
Cc:	"PG Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-18 15:14:47
Message-ID:	4D833097020000250003BAA3@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

rsmogura <rsmogura(at)softperience(dot)eu> wrote:

> Yes, there is some change, and I looked at this more carefully, as
> my performance results wasn't such as I expected. I found PG uses
> BufferAccessStrategy to do sequence scans, so my test query took
> only 32 buffers from pool and didn't overwritten index pool too
> much. This BAS is really surprising. In any case when I end
> polishing I will send good patch, with proof.

Yeah, that heuristic makes this less critical, for sure.

> Actually idea of this patch was like this:
> Some operations requires many buffers, PG uses "clock sweep" to
> get next free buffer, so it may overwrite index buffer. From point
> of view of good database design We should use indices, so purging
> out index from cache will affect performance.
>
> As the side effect I saw that this 2nd level keeps pg_* indices
> in memory too, so I think to include 3rd level cache for some pg_*
> tables.

Well, the more complex you make it the more overhead there is, which
makes it harder to come out ahead. FWIW, in musing about it (as
recently as this week), my idea was to add another field which would
factor into the clock sweep calculations. For indexes, it might be
"levels above leaf pages". I haven't reviewed the code in depth to
know how to use it, this was just idle daydreaming based on that
prior experience. It's far from certain that the concept will
actually prove beneficial in PostgreSQL.

Maybe the thing to focus on first is the oft-discussed "benchmark
farm" (similar to the "build farm"), with a good mix of loads, so
that the impact of changes can be better tracked for multiple
workloads on a variety of platforms and configurations. Without
something like that it is very hard to justify the added complexity
of an idea like this in terms of the performance benefit gained.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-18 16:19:47
Message-ID:	AANLkTimCqGFjh+bhgyzGURai_e9cCGyzA0TPqsC+NSBx@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Maybe the thing to focus on first is the oft-discussed "benchmark
> farm" (similar to the "build farm"), with a good mix of loads, so
> that the impact of changes can be better tracked for multiple
> workloads on a variety of platforms and configurations. Without
> something like that it is very hard to justify the added complexity
> of an idea like this in terms of the performance benefit gained.

A related area that could use some looking at is why performance tops
out at shared_buffers ~8GB and starts to fall thereafter. InnoDB can
apparently handle much larger buffer pools without a performance
drop-off. There are some advantages to our reliance on the OS buffer
cache, to be sure, but as RAM continues to grow this might start to
get annoying. On a 4GB system you might have shared_buffers set to
25% of memory, but on a 64GB system it'll be a smaller percentage, and
as memory capacities continue to clime it'll be smaller still.
Unfortunately I don't have the hardware to investigate this, but it's
worth thinking about, especially if we're thinking of doing things
that add more caching.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	rsmogura <rsmogura(at)softperience(dot)eu>
Cc:	Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-18 17:08:41
Message-ID:	1300467989-sup-3338@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from rsmogura's message of vie mar 18 11:57:48 -0300 2011:

> Actually idea of this patch was like this:
> Some operations requires many buffers, PG uses "clock sweep" to get
> next free buffer, so it may overwrite index buffer. From point of view
> of good database design We should use indices, so purging out index from
> cache will affect performance.

The BufferAccessStrategy stuff was written to solve this problem.

> As the side effect I saw that this 2nd level keeps pg_* indices in
> memory too, so I think to include 3rd level cache for some pg_* tables.

Keep in mind that there's already another layer of caching (see
syscache.c) for system catalogs on top of the buffer cache.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-18 17:17:37
Message-ID:	4D8393B1.2080305@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Radek,

> I have implemented initial concept of 2nd level cache. Idea is to keep some
> segments of shared memory for special buffers (e.g. indices) to prevent
> overwrite those by other operations. I added those functionality to nbtree
> index scan.

The problem with any "special" buffering of database objects (other than
maybe the system catalogs) improves one use case at the expense of
others. For example, special buffering of indexes would have a negative
effect on use cases which are primarily seq scans. Also, how would your
index buffer work for really huge indexes, like GiST and GIN indexes?

In general, I think that improving the efficiency/scalability of our
existing buffer system is probably going to bear a lot more fruit than
adding extra levels of buffering.

That being said, one my argue that the root pages of btree indexes are a
legitimate special case. However, it seems like clock-sweep would end
up keeping those in shared buffers all the time regardless.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-18 18:15:10
Message-ID:	9B2A55FF-2238-482F-848D-F19131883743@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mar 18, 2011, at 11:19 AM, Robert Haas wrote:
> On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> A related area that could use some looking at is why performance tops
> out at shared_buffers ~8GB and starts to fall thereafter. InnoDB can
> apparently handle much larger buffer pools without a performance
> drop-off. There are some advantages to our reliance on the OS buffer
> cache, to be sure, but as RAM continues to grow this might start to
> get annoying. On a 4GB system you might have shared_buffers set to
> 25% of memory, but on a 64GB system it'll be a smaller percentage, and
> as memory capacities continue to clime it'll be smaller still.
> Unfortunately I don't have the hardware to investigate this, but it's
> worth thinking about, especially if we're thinking of doing things
> that add more caching.

To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here...
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-18 21:13:43
Message-ID:	AANLkTinTTJN8AW7F1oB3TefwQbt0fRxEb5M-ya7Ddgy=@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 18, 2011 at 2:15 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> +1
>
> To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here...

The problem is that the OS doesn't offer any mechanism that would
allow us to obey the WAL-before-data rule.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Radosław Smogura <rsmogura(at)softperience(dot)eu>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"PG Hackers" <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-18 23:35:00
Message-ID:	201103190035.01030.rsmogura@softperience.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> Thursday 17 March 2011 22:02:18
> Rados*aw Smogura<rsmogura(at)softperience(dot)eu> wrote:
> > I have implemented initial concept of 2nd level cache. Idea is to
> > keep some segments of shared memory for special buffers (e.g.
> > indices) to prevent overwrite those by other operations. I added
> > those functionality to nbtree index scan.
> >
> > I tested this with doing index scan, seq read, drop system
> > buffers, do index scan and in few places I saw performance
> > improvements, but actually, I'm not sure if this was just "random"
> > or intended improvement.
>
> I've often wondered about this. In a database I developed back in
> the '80s it was clearly a win to have a special cache for index
> entries and other special pages closer to the database than the
> general cache. A couple things have changed since the '80s (I mean,
> besides my waistline and hair color), and PostgreSQL has many
> differences from that other database, so I haven't been sure it
> would help as much, but I have wondered.
>
> I can't really look at this for a couple weeks, but I'm definitely
> interested. I suggest that you add this to the next CommitFest as a
> WIP patch, under the Performance category.
>
> https://commitfest.postgresql.org/action/commitfest_view/open
>
> > There is few places to optimize code as well, and patch need many
> > work, but may you see it and give opinions?
>
> For something like this it makes perfect sense to show "proof of
> concept" before trying to cover everything.
>
> -Kevin

Here I attach latest version of patch with few performance improvements (code
is still dirty) and some reports from test, as well my simple tests.

Actually there is small improvement without dropping system caches, and bigger
with dropping. I have small performance decrease (if we can talk about
measuring basing on this tests) to original PG version when dealing with same
configuration, but increase is with 2nd level buffers... or maybe I badly
compared reports.

In tests I tried to choose typical, simple queries.

Regards,
Radek

Attachment	Content-Type	Size
2nd_lvl_cache_20110318.diff.bz2	application/x-bzip	9.9 KB
test-scritps_20110319_0026.tar.bz2	application/x-bzip-compressed-tar	2.4 KB
reports_20110318.tar.bz2	application/x-bzip-compressed-tar	2.5 KB

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura(at)softperience(dot)eu, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-18 23:55:29
Message-ID:	4D83F0F1.2020304@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/18/11 11:15 AM, Jim Nasby wrote:
> To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here...

As far as I know, no OS has a more sophisticated approach to eviction
than LRU. And clock-sweep is a significant improvement on performance
over LRU for frequently accessed database objects ... plus our
optimizations around not overwriting the whole cache for things like VACUUM.

2-level caches work well for a variety of applications.

Now, what would be *really* useful is some way to avoid all the data
copying we do between shared_buffers and the FS cache.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura(at)softperience(dot)eu, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-21 10:24:22
Message-ID:	AANLkTikM3qwZ7A1OCPEQjYJ20Ex8EX0_uzUnJKGK6nwC@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here...

A lot of people have talked about it. You can find references to mmap
going at least as far back as 2001 or so. The problem is that it would
depend on the OS implementing things in a certain way and guaranteeing
things we don't think can be portably assumed. We would need to mlock
large amounts of address space which most OS's don't allow, and we
would need to at least mlock and munlock lots of small bits of memory
all over the place which would create lots and lots of mappings which
the kernel and hardware implementations would generally not
appreciate.

> As far as I know, no OS has a more sophisticated approach to eviction
> than LRU. And clock-sweep is a significant improvement on performance
> over LRU for frequently accessed database objects ... plus our
> optimizations around not overwriting the whole cache for things like VACUUM.

The clock-sweep algorithm was standard OS design before you or I knew
how to type. I would expect any half-decent OS to have sometihng at
least as good -- perhaps better because it can rely on hardware
features to handle things.

However the second point is the crux of the issue and of all similar
issues on where to draw the line between the OS and Postgres. The OS
knows better about the hardware characteristics and can better
optimize the overall system behaviour, but Postgres understands better
its own access patterns and can better optimize its behaviour whereas
the OS is stuck reverse-engineering what Postgres needs, usually from
simple heuristics.

>
> 2-level caches work well for a variety of applications.

I think 2-level caches with simple heuristics like "pin all the
indexes" is unlikely to be helpful. At least it won't optimize the
average case and I think that's been proven. It might be helpful for
optimizing the worst-case which would reduce the standard deviation.
Perhaps we're at the point now where that matters.

Where it might be helpful is as a more refined version of the
"sequential scans use limited set of buffers" patch. Instead of having
each sequential scan use a hard coded number of buffers, perhaps all
sequential scans should share a fraction of the global buffer pool
managed separately from the main pool. Though in my thought
experiments I don't see any real win here. In the current scheme if
there's any sign the buffer is useful it gets thrown from the
sequential scan's set of buffers to reuse anyways.

> Now, what would be *really* useful is some way to avoid all the data
> copying we do between shared_buffers and the FS cache.
>

Well the two options are mmap/mlock or directio. The former might be a
fun experiment but I expect any OS to fall over pretty quickly when
faced with thousands (or millions) of 8kB mappings. The latter would
need Postgres to do async i/o and hopefully a global view of its i/o
access patterns so it could do prefetching in a lot more cases.

--
greg

From:	rsmogura <rsmogura(at)softperience(dot)eu>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-21 15:24:22
Message-ID:	8b6104d93d339f5d3755d68f759d713e@mail.softperience.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 21 Mar 2011 10:24:22 +0000, Greg Stark wrote:
> On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus <josh(at)agliodbs(dot)com>
> wrote:
>>> To take the opposite approach... has anyone looked at having the OS
>>> just manage all caching for us? Something like MMAPed shared buffers?
>>> Even if we find the issue with large shared buffers, we still can't
>>> dedicate serious amounts of memory to them because of work_mem
>>> issues. Granted, that's something else on the TODO list, but it
>>> really seems like we're re-inventing the wheels that the OS has
>>> already created here...
>
> A lot of people have talked about it. You can find references to mmap
> going at least as far back as 2001 or so. The problem is that it
> would
> depend on the OS implementing things in a certain way and
> guaranteeing
> things we don't think can be portably assumed. We would need to mlock
> large amounts of address space which most OS's don't allow, and we
> would need to at least mlock and munlock lots of small bits of memory
> all over the place which would create lots and lots of mappings which
> the kernel and hardware implementations would generally not
> appreciate.
Actually, just from curious, I done test with mmap, and I got 2% boost
on data reading, maybe because of skipping memcpy in fread. I really
curious how fast, if even, it will be if I add some good and needed
stuff and how e.g. vacuum will work.

<snip>

>> 2-level caches work well for a variety of applications.
>
> I think 2-level caches with simple heuristics like "pin all the
> indexes" is unlikely to be helpful. At least it won't optimize the
> average case and I think that's been proven. It might be helpful for
> optimizing the worst-case which would reduce the standard deviation.
> Perhaps we're at the point now where that matters.
>
Actually, 2nd level caches do not pin index buffer. It's just, in
simple words, some set of reserved buffers' ids to be used for index
pages, all logic with pining, etc. it's same, the difference is that
default level operation will not touch 2nd level. I post some reports
from my simple tests. When I was experimenting with 2nd level caches I
saw that some operations may swap out system tables buffers, too.

<snip>

Regards,
Radek

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura(at)softperience(dot)eu, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-21 15:54:08
Message-ID:	AANLkTikYce6vhevkzoNGF0KwRyLGL=h7XNcJma0ktbix@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 21, 2011 at 5:24 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here...
>
> A lot of people have talked about it. You can find references to mmap
> going at least as far back as 2001 or so. The problem is that it would
> depend on the OS implementing things in a certain way and guaranteeing
> things we don't think can be portably assumed. We would need to mlock
> large amounts of address space which most OS's don't allow, and we
> would need to at least mlock and munlock lots of small bits of memory
> all over the place which would create lots and lots of mappings which
> the kernel and hardware implementations would generally not
> appreciate.
>
>> As far as I know, no OS has a more sophisticated approach to eviction
>> than LRU. And clock-sweep is a significant improvement on performance
>> over LRU for frequently accessed database objects ... plus our
>> optimizations around not overwriting the whole cache for things like VACUUM.
>
> The clock-sweep algorithm was standard OS design before you or I knew
> how to type. I would expect any half-decent OS to have sometihng at
> least as good -- perhaps better because it can rely on hardware
> features to handle things.
>
> However the second point is the crux of the issue and of all similar
> issues on where to draw the line between the OS and Postgres. The OS
> knows better about the hardware characteristics and can better
> optimize the overall system behaviour, but Postgres understands better
> its own access patterns and can better optimize its behaviour whereas
> the OS is stuck reverse-engineering what Postgres needs, usually from
> simple heuristics.
>
>>
>> 2-level caches work well for a variety of applications.
>
> I think 2-level caches with simple heuristics like "pin all the
> indexes" is unlikely to be helpful. At least it won't optimize the
> average case and I think that's been proven. It might be helpful for
> optimizing the worst-case which would reduce the standard deviation.
> Perhaps we're at the point now where that matters.
>
> Where it might be helpful is as a more refined version of the
> "sequential scans use limited set of buffers" patch. Instead of having
> each sequential scan use a hard coded number of buffers, perhaps all
> sequential scans should share a fraction of the global buffer pool
> managed separately from the main pool. Though in my thought
> experiments I don't see any real win here. In the current scheme if
> there's any sign the buffer is useful it gets thrown from the
> sequential scan's set of buffers to reuse anyways.
>
>> Now, what would be *really* useful is some way to avoid all the data
>> copying we do between shared_buffers and the FS cache.
>>
>
> Well the two options are mmap/mlock or directio. The former might be a
> fun experiment but I expect any OS to fall over pretty quickly when
> faced with thousands (or millions) of 8kB mappings. The latter would
> need Postgres to do async i/o and hopefully a global view of its i/o
> access patterns so it could do prefetching in a lot more cases.

Can't you make just one large mapping and lock it in 8k regions? I
thought the problem with mmap was not being able to detect other
processes (http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg122301.html)
compatibility issues (possibly obsolete), etc.

merlin

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura(at)softperience(dot)eu, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-21 16:00:42
Message-ID:	4D87762A.5050303@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21.03.2011 17:54, Merlin Moncure wrote:
> Can't you make just one large mapping and lock it in 8k regions? I
> thought the problem with mmap was not being able to detect other
> processes (http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg122301.html)
> compatibility issues (possibly obsolete), etc.

That mail is about replacing SysV shared memory with mmap(). Detecting
other processes is a problem in that use, but that's not an issue with
using mmap() to replace shared buffers.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-21 16:47:21
Message-ID:	4D878119.9050700@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/21/11 3:24 AM, Greg Stark wrote:
>> 2-level caches work well for a variety of applications.
>
> I think 2-level caches with simple heuristics like "pin all the
> indexes" is unlikely to be helpful. At least it won't optimize the
> average case and I think that's been proven. It might be helpful for
> optimizing the worst-case which would reduce the standard deviation.
> Perhaps we're at the point now where that matters.

You're missing my point ... Postgres already *has* a 2-level cache:
shared_buffers and the FS cache. Anything we add to that will be adding
levels.

We already did that, actually, when we implemented ARC: effectively gave
PostgreSQL a 3-level cache. The results were not very good, although
the algorithm could be at fault there.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura(at)softperience(dot)eu, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-21 19:08:16
Message-ID:	AANLkTikc7vetPLSdQqhijgV=Fi_LGWHmL-JfaYetPDbN@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> Can't you make just one large mapping and lock it in 8k regions? I
> thought the problem with mmap was not being able to detect other
> processes (http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg122301.html)
> compatibility issues (possibly obsolete), etc.

I was assuming that locking part of a mapping would force the kernel
to split the mapping. It has to record the locked state somewhere so
it needs a data structure that represents the size of the locked
section and that would, I assume, be the mapping.

It's possible the kernel would not in fact fall over too badly doing
this. At some point I'll go ahead and do experiments on it. It's a bit
fraught though as it the performance may depend on the memory
management features of the chipset.

That said, that's only part of the battle. On 32bit you can't map the
whole database as your database could easily be larger than your
address space. I have some ideas on how to tackle that but the
simplest test would be to just mmap 8kB chunks everywhere.

But it's worse than that. Since you're not responsible for flushing
blocks to disk any longer you need some way to *unlock* a block when
it's possible to be flushed. That means when you flush the xlog you
have to somehow find all the blocks that might no longer need to be
locked and atomically unlock them. That would require new
infrastructure we don't have though it might not be too hard.

What would be nice is a mlock_until() where you eventually issue a
call to tell the kernel what point in time you've reached and it
unlocks everything older than that time.

--
greg

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-21 19:17:32
Message-ID:	AANLkTimjvp1c-NfsN6_=R1evUp5d-xH501ewd5D8U7uj@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 21, 2011 at 4:47 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> You're missing my point ... Postgres already *has* a 2-level cache:
> shared_buffers and the FS cache. Anything we add to that will be adding
> levels.

I don't think those two levels are interesting -- they don't interact
cleverly at all.

I was assuming the two levels were segments of the shared buffers that
didn't interoperate at all. If you kick buffers from the higher level
cache into the lower level one then why not just increase the number
of clock sweeps before you flush a buffer and insert non-index pages
into a lower clock level instead of writing code for two levels?

I don't think it will outperform in general because LRU is provably
within some margin from optimal and the clock sweep is an approximate
LRU. The only place you're going to find wins is when you know
something extra about the *future* access pattern that the lru/clock
doesn't know based on the past behaviour. Just saying "indexes are
heavily used" or "system tables are heavily used" isn't really extra
information since the LRU can figure that out. Something like
"sequential scans of tables larger than shared buffers don't go back
and read old pages before they age out" is.

The other place you might win is if you have some queries that you
want to always be fast at the expense of slower queries. So your short
web queries that only need to touch a few small tables and system
tables can tag buffers that are higher priority and shouldn't be
swapped out to achieve a slightly higher hit rate on the global cache.

--
greg

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-21 19:33:50
Message-ID:	1300735965-sup-8640@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Josh Berkus's message of lun mar 21 13:47:21 -0300 2011:

> We already did that, actually, when we implemented ARC: effectively gave
> PostgreSQL a 3-level cache. The results were not very good, although
> the algorithm could be at fault there.

Was it really all that bad? IIRC we replaced ARC with the current clock
sweep due to patent concerns. (Maybe there were performance concerns as
well, I don't remember).

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-21 19:45:54
Message-ID:	4D87AAF2.8040005@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Was it really all that bad? IIRC we replaced ARC with the current clock
> sweep due to patent concerns. (Maybe there were performance concerns as
> well, I don't remember).

Yeah, that was why the patent was frustrating. Performance was poor and
we were planning on replacing ARC in 8.2 anyway. Instead we had to
backport it.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura(at)softperience(dot)eu, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-21 19:58:16
Message-ID:	AANLkTimCiwB-kxYqvTAy5-hCRwRjY1XQknfMkpMjhmdA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> Can't you make just one large mapping and lock it in 8k regions? I
>> thought the problem with mmap was not being able to detect other
>> processes (http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg122301.html)
>> compatibility issues (possibly obsolete), etc.
>
> I was assuming that locking part of a mapping would force the kernel
> to split the mapping. It has to record the locked state somewhere so
> it needs a data structure that represents the size of the locked
> section and that would, I assume, be the mapping.
>
> It's possible the kernel would not in fact fall over too badly doing
> this. At some point I'll go ahead and do experiments on it. It's a bit
> fraught though as it the performance may depend on the memory
> management features of the chipset.
>
> That said, that's only part of the battle. On 32bit you can't map the
> whole database as your database could easily be larger than your
> address space. I have some ideas on how to tackle that but the
> simplest test would be to just mmap 8kB chunks everywhere.

Even on 64 bit systems you only have 48 bit address space which is not
a theoretical limitation. However, at least on linux you can map in
and map out pretty quick (10 microseconds paired on my linux vm) so
that's not so big of a deal. Dealing with rapidly growing files is a
problem. That said, probably you are not going to want to reserve
multiple gigabytes in 8k non contiguous chunks.

> But it's worse than that. Since you're not responsible for flushing
> blocks to disk any longer you need some way to *unlock* a block when
> it's possible to be flushed. That means when you flush the xlog you
> have to somehow find all the blocks that might no longer need to be
> locked and atomically unlock them. That would require new
> infrastructure we don't have though it might not be too hard.
>
> What would be nice is a mlock_until() where you eventually issue a
> call to tell the kernel what point in time you've reached and it
> unlocks everything older than that time.

I wonder if there is any reason to mlock at all...if you are going to
'do' mmap, can't you just hide under current lock architecture for
actual locking and do direct memory access without mlock?

merlin

From:	Radosław Smogura <rsmogura(at)softperience(dot)eu>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-21 22:05:12
Message-ID:	201103212305.12352.rsmogura@softperience.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Merlin Moncure <mmoncure(at)gmail(dot)com> Monday 21 March 2011 20:58:16
> On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure <mmoncure(at)gmail(dot)com>
wrote:
> >> Can't you make just one large mapping and lock it in 8k regions? I
> >> thought the problem with mmap was not being able to detect other
> >> processes
> >> (http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg122301.htm
> >> l) compatibility issues (possibly obsolete), etc.
> >
> > I was assuming that locking part of a mapping would force the kernel
> > to split the mapping. It has to record the locked state somewhere so
> > it needs a data structure that represents the size of the locked
> > section and that would, I assume, be the mapping.
> >
> > It's possible the kernel would not in fact fall over too badly doing
> > this. At some point I'll go ahead and do experiments on it. It's a bit
> > fraught though as it the performance may depend on the memory
> > management features of the chipset.
> >
> > That said, that's only part of the battle. On 32bit you can't map the
> > whole database as your database could easily be larger than your
> > address space. I have some ideas on how to tackle that but the
> > simplest test would be to just mmap 8kB chunks everywhere.
>
> Even on 64 bit systems you only have 48 bit address space which is not
> a theoretical limitation. However, at least on linux you can map in
> and map out pretty quick (10 microseconds paired on my linux vm) so
> that's not so big of a deal. Dealing with rapidly growing files is a
> problem. That said, probably you are not going to want to reserve
> multiple gigabytes in 8k non contiguous chunks.
>
> > But it's worse than that. Since you're not responsible for flushing
> > blocks to disk any longer you need some way to *unlock* a block when
> > it's possible to be flushed. That means when you flush the xlog you
> > have to somehow find all the blocks that might no longer need to be
> > locked and atomically unlock them. That would require new
> > infrastructure we don't have though it might not be too hard.
> >
> > What would be nice is a mlock_until() where you eventually issue a
> > call to tell the kernel what point in time you've reached and it
> > unlocks everything older than that time.
Sorry for curious, but I think mlock is for swap prevent not for flush
prevent.

> I wonder if there is any reason to mlock at all...if you are going to
> 'do' mmap, can't you just hide under current lock architecture for
> actual locking and do direct memory access without mlock?
>
> merlin

mmap man do not say anything about when flush occurs when mmap is file and is
shared, so flushes may be intended or not. Much more, this what I read, SysV
shared memory is emulated by mmap (and I think this mmap is on /dev/shm)

Radek

From:	KONDO Mitsumasa <kondo(dot)mitsumasa(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-22 10:11:44
Message-ID:	4D8875E0.1000302@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, hackers.

I am interested in this discussion!
So I surveyed current buffer algorithms around other software. I share about it.
(sorry, it is easy survey..)

CLOCK-PRO and LIRS are popular in current buffer algorithms in my easy survey.
Their algorithms are same author that is Song Jiang.
CLOCK-PRO is improved LIRS algorithm based on CLOCK algorithm.

CLOCK-PRO is used by Apache Derby and NetBSD.
And, LIRS is used by MySQL.

The following is easy explanation of LIRS.

LRU use Recency metric that is the number of other blocks accessed from last reference to the current time.

Strong points of LRU
- Low overhead and simplicity data structure
- LRU assumption is works well

Weak points of LRU
- A recently used block will not necessarily be used again or soon
- The prediction is based on a single source information

LIRS algorithm use Recency metric and Inter-Reference Recency(IRR) metric that is the number of other unique blocks accessed between two consecutive references to the block.
The priority in LIRS algorithm is the order of IRR and Recency.
IRR metric compensate for LRU weak points.

LIRS paper insists on the following.
- LIRS is same overhead as LRU.
- Results of experiments were indicated that LIRS is higher buffer hit rate than LRU and other buffer algorithms.
* Their experiment is used LIRS and other algorithms in PostgreSQL buffer system.

In CLOCK-PRO paper is indicated that CLOCK-PRO is superior than LIRS and other buffer algorithms (including Arc).

I think that PostgreSQL is very powerful and reliable database!
So I hope that PostgreSQL buffer system will be more powerful and more intelligent.

Thanks.

[Refference]
- CLOCK-PRO: http://www.ece.eng.wayne.edu/~sjiang/pubs/papers/jiang05_CLOCK-Pro.pdf
- LIRS: http://dragonstar.ict.ac.cn/course_09/XD_Zhang/%286%29-LIRS-replacement.pdf
- Apache Derbey(Google Summer Code): http://www.eecg.toronto.edu/~gokul/derby/derby-report-aug-19-2006.pdf
- NetBSD source code: http://fxr.watson.org/fxr/source/uvm/uvm_pdpolicy_clockpro.c?v=NETBSD
- MySQL source code: http://mysql.lamphost.net/sources/doxygen/mysql-5.1/structPgman_1_1Page__entry.html
- Song Jiang HP: http://www.ece.eng.wayne.edu/~sjiang/

--
Kondo Mitsumasa
NTT Corporation, NTT Open Source Software Center

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-22 15:24:54
Message-ID:	AANLkTikh+XAs8Ph5RwtzJPLo_bVc8w5LwDDEStQjEQd5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> Maybe the thing to focus on first is the oft-discussed "benchmark
>> farm" (similar to the "build farm"), with a good mix of loads, so
>> that the impact of changes can be better tracked for multiple
>> workloads on a variety of platforms and configurations. Without
>> something like that it is very hard to justify the added complexity
>> of an idea like this in terms of the performance benefit gained.
>
> A related area that could use some looking at is why performance tops
> out at shared_buffers ~8GB and starts to fall thereafter.

Under what circumstances does this happen? Can a simple pgbench -S
with a large scaling factor elicit this behavior?

Cheers,

Jeff

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-22 16:47:37
Message-ID:	AANLkTikOk_-dOD7hwYGwM_dW+bo5NUoi1TMwjaPWx+VD@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 18, 2011 at 8:14 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> rsmogura <rsmogura(at)softperience(dot)eu> wrote:
>
>> Yes, there is some change, and I looked at this more carefully, as
>> my performance results wasn't such as I expected. I found PG uses
>> BufferAccessStrategy to do sequence scans, so my test query took
>> only 32 buffers from pool and didn't overwritten index pool too
>> much. This BAS is really surprising. In any case when I end
>> polishing I will send good patch, with proof.
>
> Yeah, that heuristic makes this less critical, for sure.
>
>> Actually idea of this patch was like this:
>> Some operations requires many buffers, PG uses "clock sweep" to
>> get next free buffer, so it may overwrite index buffer. From point
>> of view of good database design We should use indices, so purging
>> out index from cache will affect performance.
>>
>> As the side effect I saw that this 2nd level keeps pg_* indices
>> in memory too, so I think to include 3rd level cache for some pg_*
>> tables.
>
> Well, the more complex you make it the more overhead there is, which
> makes it harder to come out ahead. FWIW, in musing about it (as
> recently as this week), my idea was to add another field which would
> factor into the clock sweep calculations. For indexes, it might be
> "levels above leaf pages".

The high level blocks of frequently used indexes do a pretty good job
of keeping their usage counts high already, and so probably stay in
the buffer pool already. And to the extent they don't, promoting all
indexes (even infrequently used ones, which I think most databases
have) would probably not be the way to encourage the others.

I would be more interested in looking at the sweep algorithm itself.
One thing I noticed in simulating the clock sweep is that the entry of
pages into the buffer with a usage count of 1 might not be very
useful. That give that page 2 sweeps of the clock arm before getting
evicted, so they have an opportunity to get used again. But since all
the blocks they are competing against also do the same thing, that
just means the arm sweeps about twice as fast, so they don't really
get much more of an opportunity. The other thought was that each
buffers gets its usage incremented by 2 or 3 rather than 1 each time
it is found already in the cache.

> Maybe the thing to focus on first is the oft-discussed "benchmark
> farm" (similar to the "build farm"), with a good mix of loads, so
> that the impact of changes can be better tracked for multiple
> workloads on a variety of platforms and configurations.

Yeah, that sounds great. Even just having a centrally organized group
of scripts/programs that have a good mix of loads, without the
automated farm to go with it, would be a help.

Cheers,

Jeff

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-22 17:00:08
Message-ID:	4D88D598.9020802@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/22/2011 12:47 PM, Jeff Janes wrote:
>
>> Maybe the thing to focus on first is the oft-discussed "benchmark
>> farm" (similar to the "build farm"), with a good mix of loads, so
>> that the impact of changes can be better tracked for multiple
>> workloads on a variety of platforms and configurations.
> Yeah, that sounds great. Even just having a centrally organized group
> of scripts/programs that have a good mix of loads, without the
> automated farm to go with it, would be a help.
>
>

Part of the reason for releasing the buildfarm server code a few months
ago (see <https://github.com/PGBuildFarm/server-code>) was to encourage
development of a benchmark farm, amoong other offspring. But I haven't
seen such an animal emerging.

Someone just needs to sit down and do it and present us with a fait
accompli.

cheers

andrew

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-22 19:53:08
Message-ID:	AANLkTimhWo=rNeQ55OeXgVzder3-e8fMxHz7oP7n08UN@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
>> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>>> Maybe the thing to focus on first is the oft-discussed "benchmark
>>> farm" (similar to the "build farm"), with a good mix of loads, so
>>> that the impact of changes can be better tracked for multiple
>>> workloads on a variety of platforms and configurations. Without
>>> something like that it is very hard to justify the added complexity
>>> of an idea like this in terms of the performance benefit gained.
>>
>> A related area that could use some looking at is why performance tops
>> out at shared_buffers ~8GB and starts to fall thereafter.
>
> Under what circumstances does this happen? Can a simple pgbench -S
> with a large scaling factor elicit this behavior?

To be honest, I'm mostly just reporting what I've heard Greg Smith say
on this topic. I don't have any machine with that kind of RAM.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Devrim GÜNDÜZ <devrim(at)gunduz(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Mark Wong <markwkm(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-22 19:55:32
Message-ID:	1300823732.14936.41.camel@lenovo01-laptop03.gunduz.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2011-03-22 at 15:53 -0400, Robert Haas wrote:
>
> To be honest, I'm mostly just reporting what I've heard Greg Smith say
> on this topic. I don't have any machine with that kind of RAM.

I thought we had a machine for hackers who want to do performance
testing. Mark?
--
Devrim GÜNDÜZ
Principal Systems Engineer @ EnterpriseDB: http://www.enterprisedb.com
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz

From:	Radosław Smogura <rsmogura(at)softperience(dot)eu>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-22 21:28:02
Message-ID:	201103222228.03265.rsmogura@softperience.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Merlin Moncure <mmoncure(at)gmail(dot)com> Monday 21 March 2011 20:58:16
> On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure <mmoncure(at)gmail(dot)com>
wrote:
> >> Can't you make just one large mapping and lock it in 8k regions? I
> >> thought the problem with mmap was not being able to detect other
> >> processes
> >> (http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg122301.htm
> >> l) compatibility issues (possibly obsolete), etc.
> >
> > I was assuming that locking part of a mapping would force the kernel
> > to split the mapping. It has to record the locked state somewhere so
> > it needs a data structure that represents the size of the locked
> > section and that would, I assume, be the mapping.
> >
> > It's possible the kernel would not in fact fall over too badly doing
> > this. At some point I'll go ahead and do experiments on it. It's a bit
> > fraught though as it the performance may depend on the memory
> > management features of the chipset.
> >
> > That said, that's only part of the battle. On 32bit you can't map the
> > whole database as your database could easily be larger than your
> > address space. I have some ideas on how to tackle that but the
> > simplest test would be to just mmap 8kB chunks everywhere.
>
> Even on 64 bit systems you only have 48 bit address space which is not
> a theoretical limitation. However, at least on linux you can map in
> and map out pretty quick (10 microseconds paired on my linux vm) so
> that's not so big of a deal. Dealing with rapidly growing files is a
> problem. That said, probably you are not going to want to reserve
> multiple gigabytes in 8k non contiguous chunks.
>
> > But it's worse than that. Since you're not responsible for flushing
> > blocks to disk any longer you need some way to *unlock* a block when
> > it's possible to be flushed. That means when you flush the xlog you
> > have to somehow find all the blocks that might no longer need to be
> > locked and atomically unlock them. That would require new
> > infrastructure we don't have though it might not be too hard.
> >
> > What would be nice is a mlock_until() where you eventually issue a
> > call to tell the kernel what point in time you've reached and it
> > unlocks everything older than that time.
>
> I wonder if there is any reason to mlock at all...if you are going to
> 'do' mmap, can't you just hide under current lock architecture for
> actual locking and do direct memory access without mlock?
>
> merlin

Actually after dealing with mmap and adding munmap I found crucial thing why
to not use mmap:
You need to munmap, and for me this takes much time, even if I read with
SHARED | PROT_READ, it's looks like Linux do flush or something else, same as
with MAP_FIXED, MAP_PRIVATE, etc.

Regards,
Radek

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Radosław Smogura <rsmogura(at)softperience(dot)eu>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-22 22:06:02
Message-ID:	AANLkTi=Buitd-AhSKccQuqu8FPUodfO+HHFscbX4AeuR@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 22, 2011 at 4:28 PM, Radosław Smogura
<rsmogura(at)softperience(dot)eu> wrote:
> Merlin Moncure <mmoncure(at)gmail(dot)com> Monday 21 March 2011 20:58:16
>> On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>> > On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure <mmoncure(at)gmail(dot)com>
> wrote:
>> >> Can't you make just one large mapping and lock it in 8k regions? I
>> >> thought the problem with mmap was not being able to detect other
>> >> processes
>> >> (http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg122301.htm
>> >> l) compatibility issues (possibly obsolete), etc.
>> >
>> > I was assuming that locking part of a mapping would force the kernel
>> > to split the mapping. It has to record the locked state somewhere so
>> > it needs a data structure that represents the size of the locked
>> > section and that would, I assume, be the mapping.
>> >
>> > It's possible the kernel would not in fact fall over too badly doing
>> > this. At some point I'll go ahead and do experiments on it. It's a bit
>> > fraught though as it the performance may depend on the memory
>> > management features of the chipset.
>> >
>> > That said, that's only part of the battle. On 32bit you can't map the
>> > whole database as your database could easily be larger than your
>> > address space. I have some ideas on how to tackle that but the
>> > simplest test would be to just mmap 8kB chunks everywhere.
>>
>> Even on 64 bit systems you only have 48 bit address space which is not
>> a theoretical limitation. However, at least on linux you can map in
>> and map out pretty quick (10 microseconds paired on my linux vm) so
>> that's not so big of a deal. Dealing with rapidly growing files is a
>> problem. That said, probably you are not going to want to reserve
>> multiple gigabytes in 8k non contiguous chunks.
>>
>> > But it's worse than that. Since you're not responsible for flushing
>> > blocks to disk any longer you need some way to *unlock* a block when
>> > it's possible to be flushed. That means when you flush the xlog you
>> > have to somehow find all the blocks that might no longer need to be
>> > locked and atomically unlock them. That would require new
>> > infrastructure we don't have though it might not be too hard.
>> >
>> > What would be nice is a mlock_until() where you eventually issue a
>> > call to tell the kernel what point in time you've reached and it
>> > unlocks everything older than that time.
>>
>> I wonder if there is any reason to mlock at all...if you are going to
>> 'do' mmap, can't you just hide under current lock architecture for
>> actual locking and do direct memory access without mlock?
>>
>> merlin
>
> Actually after dealing with mmap and adding munmap I found crucial thing why
> to not use mmap:
> You need to munmap, and for me this takes much time, even if I read with
> SHARED | PROT_READ, it's looks like Linux do flush or something else, same as
> with MAP_FIXED, MAP_PRIVATE, etc.

can you produce small program demonstrating the problem? This is not
how things should work AIUI.

I was thinking about playing with mmap implementation of clog system
-- it's perhaps better fit. clog is rigidly defined size, and has
very high performance requirements. Also it's much less changes than
reimplementing heap buffering, and maybe not so much affected by
munmap.

merlin

From:	Radosław Smogura <rsmogura(at)softperience(dot)eu>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-23 15:50:06
Message-ID:	201103231650.07007.rsmogura@softperience.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Merlin Moncure <mmoncure(at)gmail(dot)com> Monday 21 March 2011 20:58:16
> On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure <mmoncure(at)gmail(dot)com>
wrote:
> >> Can't you make just one large mapping and lock it in 8k regions? I
> >> thought the problem with mmap was not being able to detect other
> >> processes
> >> (http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg122301.htm
> >> l) compatibility issues (possibly obsolete), etc.
> >
> > I was assuming that locking part of a mapping would force the kernel
> > to split the mapping. It has to record the locked state somewhere so
> > it needs a data structure that represents the size of the locked
> > section and that would, I assume, be the mapping.
> >
> > It's possible the kernel would not in fact fall over too badly doing
> > this. At some point I'll go ahead and do experiments on it. It's a bit
> > fraught though as it the performance may depend on the memory
> > management features of the chipset.
> >
> > That said, that's only part of the battle. On 32bit you can't map the
> > whole database as your database could easily be larger than your
> > address space. I have some ideas on how to tackle that but the
> > simplest test would be to just mmap 8kB chunks everywhere.
>
> Even on 64 bit systems you only have 48 bit address space which is not
> a theoretical limitation. However, at least on linux you can map in
> and map out pretty quick (10 microseconds paired on my linux vm) so
> that's not so big of a deal. Dealing with rapidly growing files is a
> problem. That said, probably you are not going to want to reserve
> multiple gigabytes in 8k non contiguous chunks.
>
> > But it's worse than that. Since you're not responsible for flushing
> > blocks to disk any longer you need some way to *unlock* a block when
> > it's possible to be flushed. That means when you flush the xlog you
> > have to somehow find all the blocks that might no longer need to be
> > locked and atomically unlock them. That would require new
> > infrastructure we don't have though it might not be too hard.
> >
> > What would be nice is a mlock_until() where you eventually issue a
> > call to tell the kernel what point in time you've reached and it
> > unlocks everything older than that time.
>
> I wonder if there is any reason to mlock at all...if you are going to
> 'do' mmap, can't you just hide under current lock architecture for
> actual locking and do direct memory access without mlock?
>
> merlin
I can't reproduce this. Simple test shows 2x faster read with mmap that
read();

I'm sending this what I done with mmap (really ugly, but I'm in forest). It is
read only solution, so init database first with some amount of data (I have
about 300MB) (2nd level scripts may do this for You).

This what I found:
1. If I not require to put new mmap (mmap with FIXED) in previous region (just
I do munmap / mmap) with each query, execution time grows, about 10%.

2. Sometimes is enough just to comment or uncomment something that do not have
side effects on code flow (bufmgr.c; (un)comment some unused if; put NULL, it
will be replaced), and e.g. query execution time may grow 2x.

3. My initial solution, was 2% faster, about 9ms when reading, now it's 10%
slower, after making them more usable.

Regards,
Radek

Attachment	Content-Type	Size
pg_mmap_20110323.patch.bz2	application/x-bzip	13.3 KB

From:	Radosław Smogura <rsmogura(at)softperience(dot)eu>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-23 15:52:25
Message-ID:	201103231652.25488.rsmogura@softperience.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Merlin Moncure <mmoncure(at)gmail(dot)com> Tuesday 22 March 2011 23:06:02
> On Tue, Mar 22, 2011 at 4:28 PM, Radosław Smogura
>
> <rsmogura(at)softperience(dot)eu> wrote:
> > Merlin Moncure <mmoncure(at)gmail(dot)com> Monday 21 March 2011 20:58:16
> >
> >> On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> >> > On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure <mmoncure(at)gmail(dot)com>
> >
> > wrote:
> >> >> Can't you make just one large mapping and lock it in 8k regions? I
> >> >> thought the problem with mmap was not being able to detect other
> >> >> processes
> >> >> (http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg122301.h
> >> >> tm l) compatibility issues (possibly obsolete), etc.
> >> >
> >> > I was assuming that locking part of a mapping would force the kernel
> >> > to split the mapping. It has to record the locked state somewhere so
> >> > it needs a data structure that represents the size of the locked
> >> > section and that would, I assume, be the mapping.
> >> >
> >> > It's possible the kernel would not in fact fall over too badly doing
> >> > this. At some point I'll go ahead and do experiments on it. It's a bit
> >> > fraught though as it the performance may depend on the memory
> >> > management features of the chipset.
> >> >
> >> > That said, that's only part of the battle. On 32bit you can't map the
> >> > whole database as your database could easily be larger than your
> >> > address space. I have some ideas on how to tackle that but the
> >> > simplest test would be to just mmap 8kB chunks everywhere.
> >>
> >> Even on 64 bit systems you only have 48 bit address space which is not
> >> a theoretical limitation. However, at least on linux you can map in
> >> and map out pretty quick (10 microseconds paired on my linux vm) so
> >> that's not so big of a deal. Dealing with rapidly growing files is a
> >> problem. That said, probably you are not going to want to reserve
> >> multiple gigabytes in 8k non contiguous chunks.
> >>
> >> > But it's worse than that. Since you're not responsible for flushing
> >> > blocks to disk any longer you need some way to *unlock* a block when
> >> > it's possible to be flushed. That means when you flush the xlog you
> >> > have to somehow find all the blocks that might no longer need to be
> >> > locked and atomically unlock them. That would require new
> >> > infrastructure we don't have though it might not be too hard.
> >> >
> >> > What would be nice is a mlock_until() where you eventually issue a
> >> > call to tell the kernel what point in time you've reached and it
> >> > unlocks everything older than that time.
> >>
> >> I wonder if there is any reason to mlock at all...if you are going to
> >> 'do' mmap, can't you just hide under current lock architecture for
> >> actual locking and do direct memory access without mlock?
> >>
> >> merlin
> >
> > Actually after dealing with mmap and adding munmap I found crucial thing
> > why to not use mmap:
> > You need to munmap, and for me this takes much time, even if I read with
> > SHARED | PROT_READ, it's looks like Linux do flush or something else,
> > same as with MAP_FIXED, MAP_PRIVATE, etc.
>
> can you produce small program demonstrating the problem? This is not
> how things should work AIUI.
>
> I was thinking about playing with mmap implementation of clog system
> -- it's perhaps better fit. clog is rigidly defined size, and has
> very high performance requirements. Also it's much less changes than
> reimplementing heap buffering, and maybe not so much affected by
> munmap.
>
> merlin

Ah... just one thing, maybe usefull why performance is lost with huge memory.
I saw mmaped buffers are allocated in something like 0x007, so definitly above
4gb.

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-23 17:53:00
Message-ID:	26A0B7FC-369E-41D9-857A-84969A2C8998@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mar 22, 2011, at 2:53 PM, Robert Haas wrote:
> On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
>>> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>>>> Maybe the thing to focus on first is the oft-discussed "benchmark
>>>> farm" (similar to the "build farm"), with a good mix of loads, so
>>>> that the impact of changes can be better tracked for multiple
>>>> workloads on a variety of platforms and configurations. Without
>>>> something like that it is very hard to justify the added complexity
>>>> of an idea like this in terms of the performance benefit gained.
>>>
>>> A related area that could use some looking at is why performance tops
>>> out at shared_buffers ~8GB and starts to fall thereafter.
>>
>> Under what circumstances does this happen? Can a simple pgbench -S
>> with a large scaling factor elicit this behavior?
>
> To be honest, I'm mostly just reporting what I've heard Greg Smith say
> on this topic. I don't have any machine with that kind of RAM.

When we started using 192G servers we tried switching our largest OLTP database (would have been about 1.2TB at the time) from 8GB shared buffers to 28GB. Performance went down enough to notice; I don't have any solid metrics, but I'd ballpark it at 10-15%.

One thing that I've always wondered about is the logic of having backends run the clocksweep on a normal basis. OS's that use clock-sweep have a dedicated process to run the clock in the background, with the intent of keeping X amount of pages on the free list. We actually have most of the mechanisms to do that, we just don't have the added process. I believe bg_writer was intended to handle that, but in reality I don't think it actually manages to keep much of anything on the free list. Once we have a performance testing environment I'd be interested to test a modified version that includes a dedicated background clock sweep process that strives to keep X amount of buffers on the free list.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-23 20:00:22
Message-ID:	AANLkTi=XZiUpLrc+a-QHDtrOLyU+Xd4ngWXBCBqQHR=5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 23, 2011 at 1:53 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> When we started using 192G servers we tried switching our largest OLTP database (would have been about 1.2TB at the time) from 8GB shared buffers to 28GB. Performance went down enough to notice; I don't have any solid metrics, but I'd ballpark it at 10-15%.
>
> One thing that I've always wondered about is the logic of having backends run the clocksweep on a normal basis. OS's that use clock-sweep have a dedicated process to run the clock in the background, with the intent of keeping X amount of pages on the free list. We actually have most of the mechanisms to do that, we just don't have the added process. I believe bg_writer was intended to handle that, but in reality I don't think it actually manages to keep much of anything on the free list. Once we have a performance testing environment I'd be interested to test a modified version that includes a dedicated background clock sweep process that strives to keep X amount of buffers on the free list.

It looks like the only way anything can ever get put on the free list
right now is if a relation or database is dropped. That doesn't seem
too good. I wonder if the background writer shouldn't be trying to
maintain the free list. That is, perhaps BgBufferSync() should notice
when the number of free buffers drops below some threshold, and run
the clock sweep enough to get it back up to that threshold.

On a related note, I've been thinking about whether we could make
bgwriter_delay adaptively self-tuning. If we notice that we
overslept, we don't sleep as long the next time; if not much happens
while we sleep, we sleep longer the next time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-23 20:30:04
Message-ID:	AANLkTi=A-F4UCTwroykQqvcn0i6pa4uyvh-nGFf02ppO@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 23, 2011 at 8:00 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> It looks like the only way anything can ever get put on the free list
> right now is if a relation or database is dropped. That doesn't seem
> too good. I wonder if the background writer shouldn't be trying to
> maintain the free list. That is, perhaps BgBufferSync() should notice
> when the number of free buffers drops below some threshold, and run
> the clock sweep enough to get it back up to that threshold.
>

I think this is just a terminology discrepancy. In postgres the free
list is only used for buffers that contain no useful data at all. The
only time there are buffers on the free list is at startup or if a
relation or database is dropped.

Most of the time blocks are read into buffers that already contain
other data. Candidate buffers to evict are buffers that have been used
least recently. That's what the clock sweep implements.

What the bgwriter's responsible for is looking at the buffers *ahead*
of the clock sweep and flushing them to disk. They stay in ram and
don't go on the free list, all that changes is that they're clean and
therefore can be reused without having to do any i/o.

I'm a bit skeptical that this works because as soon as bgwriter
saturates the i/o the os will throttle the rate at which it can write.
When that happens even a few dozens of milliseconds will be plenty to
allow the purely user-space processes consuming the buffers to catch
up instantly.

But Greg Smith has done a lot of work tuning the bgwriter so that it
is at least useful in some circumstances. I could well see it being
useful for systems where latency matters and the i/o is not saturated.

--
greg

From:	Radosław Smogura <rsmogura(at)softperience(dot)eu>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-23 20:49:17
Message-ID:	201103232149.18156.rsmogura@softperience.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> Wednesday 23 March 2011 21:30:04
> On Wed, Mar 23, 2011 at 8:00 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > It looks like the only way anything can ever get put on the free list
> > right now is if a relation or database is dropped. That doesn't seem
> > too good. I wonder if the background writer shouldn't be trying to
> > maintain the free list. That is, perhaps BgBufferSync() should notice
> > when the number of free buffers drops below some threshold, and run
> > the clock sweep enough to get it back up to that threshold.
>
> I think this is just a terminology discrepancy. In postgres the free
> list is only used for buffers that contain no useful data at all. The
> only time there are buffers on the free list is at startup or if a
> relation or database is dropped.
>
> Most of the time blocks are read into buffers that already contain
> other data. Candidate buffers to evict are buffers that have been used
> least recently. That's what the clock sweep implements.
>
> What the bgwriter's responsible for is looking at the buffers *ahead*
> of the clock sweep and flushing them to disk. They stay in ram and
> don't go on the free list, all that changes is that they're clean and
> therefore can be reused without having to do any i/o.
>
> I'm a bit skeptical that this works because as soon as bgwriter
> saturates the i/o the os will throttle the rate at which it can write.
> When that happens even a few dozens of milliseconds will be plenty to
> allow the purely user-space processes consuming the buffers to catch
> up instantly.
>
> But Greg Smith has done a lot of work tuning the bgwriter so that it
> is at least useful in some circumstances. I could well see it being
> useful for systems where latency matters and the i/o is not saturated.

Freelist is almost useless under normal operations, but it's only one check if
it's empty or not, which could be optimized by checking (...> -1), or !(... <
0)

Regards,
Radek

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-23 22:12:12
Message-ID:	5893.1300918332@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> It looks like the only way anything can ever get put on the free list
> right now is if a relation or database is dropped. That doesn't seem
> too good.

Why not? AIUI the free list is only for buffers that are totally dead,
ie contain no info that's possibly of interest to anybody. It is *not*
meant to substitute for running the clock sweep when you have to discard
a live buffer.

regards, tom lane

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-24 19:36:48
Message-ID:	E521ADA8-EF65-4DEA-BE37-3B2FDE983BD8@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mar 23, 2011, at 5:12 PM, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> It looks like the only way anything can ever get put on the free list
>> right now is if a relation or database is dropped. That doesn't seem
>> too good.
>
> Why not? AIUI the free list is only for buffers that are totally dead,
> ie contain no info that's possibly of interest to anybody. It is *not*
> meant to substitute for running the clock sweep when you have to discard
> a live buffer.

Turns out we've had this discussion before: http://archives.postgresql.org/pgsql-hackers/2010-12/msg01088.php and http://archives.postgresql.org/pgsql-hackers/2010-12/msg00689.php

Tom made the point in the first one that it might be good to proactively move buffers to the freelist so that backends would normally just have to hit the freelist and not run the sweep.

Unfortunately I haven't yet been able to do any performance testing of any of this... perhaps someone else can try and measure the amount of time spent by backends running the clock sweep with different shared buffer sizes.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Radosław Smogura <rsmogura(at)softperience(dot)eu>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-24 20:27:02
Message-ID:	201103242127.02387.rsmogura@softperience.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim Nasby <jim(at)nasby(dot)net> Thursday 24 March 2011 20:36:48
> On Mar 23, 2011, at 5:12 PM, Tom Lane wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> It looks like the only way anything can ever get put on the free list
> >> right now is if a relation or database is dropped. That doesn't seem
> >> too good.
> >
> > Why not? AIUI the free list is only for buffers that are totally dead,
> > ie contain no info that's possibly of interest to anybody. It is *not*
> > meant to substitute for running the clock sweep when you have to discard
> > a live buffer.
>
> Turns out we've had this discussion before:
> http://archives.postgresql.org/pgsql-hackers/2010-12/msg01088.php and
> http://archives.postgresql.org/pgsql-hackers/2010-12/msg00689.php
>
> Tom made the point in the first one that it might be good to proactively
> move buffers to the freelist so that backends would normally just have to
> hit the freelist and not run the sweep.
>
> Unfortunately I haven't yet been able to do any performance testing of any
> of this... perhaps someone else can try and measure the amount of time
> spent by backends running the clock sweep with different shared buffer
> sizes. --
> Jim C. Nasby, Database Architect jim(at)nasby(dot)net
> 512.569.9461 (cell) http://jim.nasby.net

Will not be enough to take spin lock (or make ASM (lock) and increment call
for Intels/AMD) around increment StrategyControl->nextVictimBuffer, everything
here may be controlled by macro GetNextVictimBuffer(). Within for (;;) the
valid buffer may be obtained with modulo NBuffers, to decrease lock time. We
may try to calculate how many buffers we had skipped, and decrease e.g.
trycount by this value, and put some additional restriction like no more
passes then NBuffers*4 calls, and notify error.

This will made clock sweep concurrent.

Regards,
Radek

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-24 20:59:10
Message-ID:	AANLkTiktwL=AUCucv6=VqVzfx+AP8hxf8g-eYoa1NMP1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 23, 2011 at 6:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> It looks like the only way anything can ever get put on the free list
>> right now is if a relation or database is dropped. That doesn't seem
>> too good.
>
> Why not? AIUI the free list is only for buffers that are totally dead,
> ie contain no info that's possibly of interest to anybody. It is *not*
> meant to substitute for running the clock sweep when you have to discard
> a live buffer.

It seems at least plausible that buffer allocation could be
significantly faster if it need only pop the head of a list, rather
than scanning until it finds a suitable candidate. Moving as much
buffer allocation work as possible into the background seems like it
ought to be useful.

Granted, I've made no attempt to code or test this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-24 21:34:55
Message-ID:	AANLkTimEBwYDmwFc-3N_vZ+_c_5c8JZ2PxTB8qQ3ZOOM@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 24, 2011 at 8:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> It seems at least plausible that buffer allocation could be
> significantly faster if it need only pop the head of a list, rather
> than scanning until it finds a suitable candidate. Moving as much
> buffer allocation work as possible into the background seems like it
> ought to be useful.
>

Linked lists are notoriously non-concurrent, that's the whole reason
for the clock sweep algorithm to exist at all instead of just using an
LRU directly. That said, an LRU needs to be able to remove elements
from the middle and not just enqueue elements on the tail, so the
situation isn't exactly equivalent.

Just popping off the head is simple enough but the bgwriter would need
to be able to add elements to the tail of the list and the people
popping elements off the head would need to compete with it for the
lock on the list. And I think you need a single lock for the whole
list because of the cases where the list becomes a single element or
empty.

The main impact this list would have is that it would presumably need
some real number of buffers to satisfy the pressure for victim buffers
for a real amount of time. That would represent a decrease in cache
size, effectively evicting buffers from cache as if the cache were
smaller by that amount.

Theoretical results are that a small change in cache size affects
cache hit rates substantially. I'm not sure that's born out by
practical experience with Postgres though. People tend to either be
doing mostly i/o or very little i/o. Cache hit rate only really
matters and is likely to be affected by small changes in cache size in
the space in between

--
greg

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-24 21:41:19
Message-ID:	AANLkTim2SvoBqG63oTWxkzRncB0Jk9j+YtxrKqTET+LU@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 24, 2011 at 5:34 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Thu, Mar 24, 2011 at 8:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> It seems at least plausible that buffer allocation could be
>> significantly faster if it need only pop the head of a list, rather
>> than scanning until it finds a suitable candidate. Moving as much
>> buffer allocation work as possible into the background seems like it
>> ought to be useful.
>
> Linked lists are notoriously non-concurrent, that's the whole reason
> for the clock sweep algorithm to exist at all instead of just using an
> LRU directly. That said, an LRU needs to be able to remove elements
> from the middle and not just enqueue elements on the tail, so the
> situation isn't exactly equivalent.
>
> Just popping off the head is simple enough but the bgwriter would need
> to be able to add elements to the tail of the list and the people
> popping elements off the head would need to compete with it for the
> lock on the list. And I think you need a single lock for the whole
> list because of the cases where the list becomes a single element or
> empty.
>
> The main impact this list would have is that it would presumably need
> some real number of buffers to satisfy the pressure for victim buffers
> for a real amount of time. That would represent a decrease in cache
> size, effectively evicting buffers from cache as if the cache were
> smaller by that amount.
>
> Theoretical results are that a small change in cache size affects
> cache hit rates substantially. I'm not sure that's born out by
> practical experience with Postgres though. People tend to either be
> doing mostly i/o or very little i/o. Cache hit rate only really
> matters and is likely to be affected by small changes in cache size in
> the space in between

You wouldn't really have to reduce the effective cache size - there's
logic in there to just skip to the next buffer if the first one you
pull off the freelist has been reused. But the cache hit rates on
those buffers would (you'd hope) be fairly low, since they are the
ones we're about to reuse. Maybe it doesn't work out to a win,
though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Radosław Smogura <rsmogura(at)softperience(dot)eu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-24 21:59:41
Message-ID:	201103242259.41631.rsmogura@softperience.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> Thursday 24 March 2011 22:41:19
> On Thu, Mar 24, 2011 at 5:34 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > On Thu, Mar 24, 2011 at 8:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
wrote:
> >> It seems at least plausible that buffer allocation could be
> >> significantly faster if it need only pop the head of a list, rather
> >> than scanning until it finds a suitable candidate. Moving as much
> >> buffer allocation work as possible into the background seems like it
> >> ought to be useful.
> >
> > Linked lists are notoriously non-concurrent, that's the whole reason
> > for the clock sweep algorithm to exist at all instead of just using an
> > LRU directly. That said, an LRU needs to be able to remove elements
> > from the middle and not just enqueue elements on the tail, so the
> > situation isn't exactly equivalent.
> >
> > Just popping off the head is simple enough but the bgwriter would need
> > to be able to add elements to the tail of the list and the people
> > popping elements off the head would need to compete with it for the
> > lock on the list. And I think you need a single lock for the whole
> > list because of the cases where the list becomes a single element or
> > empty.
> >
> > The main impact this list would have is that it would presumably need
> > some real number of buffers to satisfy the pressure for victim buffers
> > for a real amount of time. That would represent a decrease in cache
> > size, effectively evicting buffers from cache as if the cache were
> > smaller by that amount.
> >
> > Theoretical results are that a small change in cache size affects
> > cache hit rates substantially. I'm not sure that's born out by
> > practical experience with Postgres though. People tend to either be
> > doing mostly i/o or very little i/o. Cache hit rate only really
> > matters and is likely to be affected by small changes in cache size in
> > the space in between
>
> You wouldn't really have to reduce the effective cache size - there's
> logic in there to just skip to the next buffer if the first one you
> pull off the freelist has been reused. But the cache hit rates on
> those buffers would (you'd hope) be fairly low, since they are the
> ones we're about to reuse. Maybe it doesn't work out to a win,
> though.
If I may,
Under unnormal circumstances (like current process is "held" by kernel)
obtaining buffer from list may be cheaper
this code
while (StrategyControl->firstFreeBuffer >= 0)
{
buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);

/* Unconditionally remove buffer from freelist */
StrategyControl->firstFreeBuffer = buf->freeNext;
buf->freeNext = FREENEXT_NOT_IN_LIST;
could look
do
{
SpinLock();
if (StrategyControl->firstFreeBuffer >= 0) {
Unspin()
break;
}

buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
Unspin();

Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);

/* Unconditionally remove buffer from freelist */
StrategyControl->firstFreeBuffer = buf->freeNext;
buf->freeNext = FREENEXT_NOT_IN_LIST;like this
}while(true);
and aquirng spin lock for linked list is enaugh, and cheaper then taking
lwlock is more complex than spin on this.

after this simmilary with spin lock on
trycounter = NBuffers*4;
for (;;)
{
spinlock()
buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];

if (++StrategyControl->nextVictimBuffer >= NBuffers)
{
StrategyControl->nextVictimBuffer = 0;
StrategyControl->completePasses++;
}
unspin();

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-24 23:33:33
Message-ID:	AANLkTikN_uWpaADpqFRoMxFdbB31q=XOn7dUoVqw-rO6@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 24, 2011 at 12:36 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> On Mar 23, 2011, at 5:12 PM, Tom Lane wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> It looks like the only way anything can ever get put on the free list
>>> right now is if a relation or database is dropped. That doesn't seem
>>> too good.
>>
>> Why not? AIUI the free list is only for buffers that are totally dead,
>> ie contain no info that's possibly of interest to anybody. It is *not*
>> meant to substitute for running the clock sweep when you have to discard
>> a live buffer.
>
> Turns out we've had this discussion before: http://archives.postgresql.org/pgsql-hackers/2010-12/msg01088.php and http://archives.postgresql.org/pgsql-hackers/2010-12/msg00689.php
>
> Tom made the point in the first one that it might be good to proactively move buffers to the freelist so that backends would normally just have to hit the freelist and not run the sweep.
>
> Unfortunately I haven't yet been able to do any performance testing of any of this... perhaps someone else can try and measure the amount of time spent by backends running the clock sweep with different shared buffer sizes.

I tried under the circumstances I thought were mostly likely to show a
time difference, and I was unable to detect a reliable difference in
timing between free list and clock sweep.

Cheers,

Jeff

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-25 02:51:38
Message-ID:	AANLkTimL0se6Ddjoyxp2_66-ZQhHXLpAmvWGwfvRvDQL@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 24, 2011 at 11:33 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> I tried under the circumstances I thought were mostly likely to show a
> time difference, and I was unable to detect a reliable difference in
> timing between free list and clock sweep.

It strikes me that it shouldn't be terribly hard to add a profiling
option to Postgres to dump out a list of precisely which blocks of
data were accessed in which order. Then it's fairly straightforward to
process that list using different algorithms to measure which
generates the fewest cache misses.

This is usually how the topic is handled in academic discussions. The
optimal cache policy is the one which flushes the cache entry which
will be used next the furthest into the future. Given a precalculated
file you can calculate the results from that optimal strategy and then
compare each strategy against that one.

--
greg

From:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-25 15:07:00
Message-ID:	AANLkTimNqo11N9cL6PQu45Py59BMxCLpNo+VUKrf2tDz@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 22, 2011 at 3:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> >> On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
> >> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> >>> Maybe the thing to focus on first is the oft-discussed "benchmark
> >>> farm" (similar to the "build farm"), with a good mix of loads, so
> >>> that the impact of changes can be better tracked for multiple
> >>> workloads on a variety of platforms and configurations. Without
> >>> something like that it is very hard to justify the added complexity
> >>> of an idea like this in terms of the performance benefit gained.
> >>
> >> A related area that could use some looking at is why performance tops
> >> out at shared_buffers ~8GB and starts to fall thereafter.
> >
> > Under what circumstances does this happen? Can a simple pgbench -S
> > with a large scaling factor elicit this behavior?
>
> To be honest, I'm mostly just reporting what I've heard Greg Smith say
> on this topic. I don't have any machine with that kind of RAM.
>

I can sponsor a few hours (say 10) of one High-memory on-demand Quadruple
Extra Large instance (26 EC2 Compute Units (8 virtual cores with 3.25 EC2
Compute Units each), 1690 GB of local instance storage, 64-bit platform).
That's the largest memory AWS has.

Let me know if I can help.

Regards,
--
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-25 15:58:57
Message-ID:	267AE6DC-6597-4845-B84C-FDC1CDD7103C@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mar 25, 2011, at 10:07 AM, Gurjeet Singh wrote:
> On Tue, Mar 22, 2011 at 3:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >> On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
> >> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> >>> Maybe the thing to focus on first is the oft-discussed "benchmark
> >>> farm" (similar to the "build farm"), with a good mix of loads, so
> >>> that the impact of changes can be better tracked for multiple
> >>> workloads on a variety of platforms and configurations. Without
> >>> something like that it is very hard to justify the added complexity
> >>> of an idea like this in terms of the performance benefit gained.
> >>
> >> A related area that could use some looking at is why performance tops
> >> out at shared_buffers ~8GB and starts to fall thereafter.
> >
> > Under what circumstances does this happen? Can a simple pgbench -S
> > with a large scaling factor elicit this behavior?
>
> To be honest, I'm mostly just reporting what I've heard Greg Smith say
> on this topic. I don't have any machine with that kind of RAM.
>
> I can sponsor a few hours (say 10) of one High-memory on-demand Quadruple Extra Large instance (26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of local instance storage, 64-bit platform). That's the largest memory AWS has.

Related to that... after talking to Greg Smith at PGEast last night, he felt it would be very valuable just to profile how much time is being spent waiting/holding the freelist lock in a real environment. I'm going to see if we can do that on one of our slave databases.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-25 16:26:38
Message-ID:	AANLkTi=A2sa=cTd998JkXYaCg0m7JGirEvqcnm8V2K2_@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 24, 2011 at 7:51 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Thu, Mar 24, 2011 at 11:33 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> I tried under the circumstances I thought were mostly likely to show a
>> time difference, and I was unable to detect a reliable difference in
>> timing between free list and clock sweep.
>
> It strikes me that it shouldn't be terribly hard to add a profiling
> option to Postgres to dump out a list of precisely which blocks of
> data were accessed in which order. Then it's fairly straightforward to
> process that list using different algorithms to measure which
> generates the fewest cache misses.

It is pretty easy to get the list by adding a couple "elog". To be
safe you probably also need to record pins and unpins, as you can't
evict a pinned buffer no matter how other-wise eligible it might be.
For most workloads you might be able to get away with just assuming
that if it is eligible for replacement under any reasonable strategy,
than it is very unlikely to still be pinned. Also, if the list is
derived from a concurrent environment, then the order of access you
see under a particular policy might no longer be the same if a
different policy were adopted.

But whose work-load would you use to do the testing? The ones I was
testing were simple enough that I just know what the access pattern
is, the root and 1st level branch blocks are almost always in shared
buffer, the leaf and table blocks almost never are.

Here my concern was not how to choose which block to replace in a
conceptual way, but rather how to code that selection in way that is
fast and concurrent and low latency for the latency-sensitive
processes. Either method will evict the same blocks, with the
exception of differences introduced by race conditions that get
resolved differently.

A benefit of focusing on the implementation rather than the high level
selection strategy is that improvements in implementation are more
likely to better carry over to other workloads.

My high level conclusions were that the running of the selection is
generally not a bottleneck, and in the cases where it was, the
bottleneck was due to contention on the LWLock, regardless of what was
done under that lock. Changing who does the clock-sweep is probably
not meaningful unless it facilitates a lock-strength reduction or
other contention reduction.

I have also played with simulations of different algorithms for
managing the usage_count, and I could get improvements but they
weren't big enough or general enough to be very exciting. It was
generally the case were if the data size was X, the improvement was
maybe 30% over the current, but if the data size was <0.8X or >1.2X,
there was no difference. So not very general.

Cheers,

Jeff

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-25 18:08:31
Message-ID:	A989E53F-C1BA-44BB-AEC9-8898F2C8BC04@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mar 25, 2011, at 11:58 AM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> Related to that... after talking to Greg Smith at PGEast last night, he felt it would be very valuable just to profile how much time is being spent waiting/holding the freelist lock in a real environment. I'm going to see if we can do that on one of our slave databases.

Yeah, that would be great. Also, some LWLOCK_STATS output or oprofile output would be definitely be useful.

...Robert

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura <rsmogura(at)softperience(dot)eu>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-26 22:01:46
Message-ID:	AANLkTi=3Ar8JPzSwhQWS6u61WaHWSNJUXO5FQ8a+fGgM@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 25, 2011 at 8:07 AM, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com> wrote:
> On Tue, Mar 22, 2011 at 3:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> > On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
>> > wrote:
>> >>
>> >> A related area that could use some looking at is why performance tops
>> >> out at shared_buffers ~8GB and starts to fall thereafter.
>> >
>> > Under what circumstances does this happen? Can a simple pgbench -S
>> > with a large scaling factor elicit this behavior?
>>
>> To be honest, I'm mostly just reporting what I've heard Greg Smith say
>> on this topic. I don't have any machine with that kind of RAM.
>
> I can sponsor a few hours (say 10) of one High-memory on-demand Quadruple
> Extra Large instance (26 EC2 Compute Units (8 virtual cores with 3.25 EC2
> Compute Units each), 1690 GB of local instance storage, 64-bit platform).
> That's the largest memory AWS has.

Does AWS have machines with battery-backed write cache? I think
people running servers with 192G probably have BBWC, so it may be hard
to do realistic tests without also having one on the test machine.

But probably a bigger problem is that (to the best of my knowledge) we
don't seem to have a non-proprietary, generally implementable
benchmark system or load-generator which is known to demonstrate the
problem.

Cheers,

Jeff

From:	rsmogura <rsmogura(at)softperience(dot)eu>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	<josh(at)agliodbs(dot)com>, <gsstark(at)mit(dot)edu>, <jim(at)nasby(dot)net>, <robertmhaas(at)gmail(dot)com>, <Kevin(dot)Grittner(at)wicourts(dot)gov>, <pgsql-hackers(at)postgresql(dot)org>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-31 13:53:01
Message-ID:	420c1ace890f429adcdb3df12bd35257@mail.softperience.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 26 Mar 2011 08:33:42 -0400, Merlin Moncure wrote:
> On Fri, Mar 25, 2011 at 11:02 PM, Radosław Smogura
> <rsmogura(at)softperience(dot)eu> wrote:
>> Merlin Moncure <mmoncure(at)gmail(dot)com> Thursday 24 March 2011 15:50:36
>>> On Thu, Mar 24, 2011 at 1:25 AM, Radosław Smogura
>>>
>>> <rsmogura(at)softperience(dot)eu> wrote:
>>> > Merlin Moncure <mmoncure(at)gmail(dot)com> Wednesday 23 March 2011
>>> 21:26:16
>>> >
>>> >> On Wed, Mar 23, 2011 at 3:23 PM, Radosław Smogura
>>> >>
>>> >> <rsmogura(at)softperience(dot)eu> wrote:
>>> >> > Simple allocating whole file and pointer add (as I found on
>>> some
>>> >> > forum, too),
>>> >>
>>> >> got a link for that?
>>> >>
>>> >> > is performance killer. Query executes 2.5x slower. Adding
>>> mlock is
>>> >> > next performance killer, hehe.
>>> >>
>>> >> there is no reason to mlock in postgres.
>>> >>
>>> >> > I saw mmaped code is really sensitive. Commenting/uncommenting
>>> >> > statement that doesn't gives anything to code flow may kill
>>> >> > performance, maybe kernel swaps out pages.
>>> >>
>>> >> hm. you are sure mmap is slower??
>>> >>
>>> >> merlin
>>> >
>>> > I found da light,
>>> >
>>> > Hehe. When I switched to mmap "whole file", I compiled pg with
>>> debug, no
>>> > optimization and casserts!!!
>>> >
>>> > mmap it's really faster, query that took 450ms, went down to
>>> 410ms, and
>>> > when I bootstrap mmaping query takes 430ms (situation: one query
>>> one
>>> > backend).
>>>
>>> This is really good news. I ran several tests and mmap is
>>> outperforming read() by factor of 2x() in some cases and
>>> underperforming in others. I'm still not sure it will work out to
>>> win
>>> in the end.
>>>
>>> I did some more looking in terms of how deeply you can replace
>>> shared
>>> buffer implementation. Hooking into current bufmgr is simpler
>>> approach. Critical logic is fired (XLogFlush) when buffers leave
>>> shared buffer system. Insertion into WAL (XLogInsert) however is
>>> managed outside of bufmgr.
>>>
>>> shared buffers play two critical roles: they so buffer pages on top
>>> of
>>> file cache but also stage dirty data so you are not constantly
>>> flusing
>>> xlog. my idea yesterday would not perform xlog cache. however,
>>> probably server would perform better with buffers reserved strictly
>>> for write caching.
>>>
>>> Just thinking out loud. I'm learning as I go.
>>>
>>> merlin
>> I think read is done (no locals, no API clean), "final" solution was
>> to not
>> mmap whole file only, but mmap it with maximum size - segment size
>> :)
>>
>> In addition crash report API with simple generator. If you will
>> compile with -
>> ggdb -g3 nice results may be printed in order of crash.
>>
>> Added simple tests and switch for mmap in configure.
>>
>> I don't know if I will have time to look at mmap in this weekend.
>
> thanks -- I'll take a look.
>
> merlin

I think I done this, at lest at simple level, without big optimization
etc. At least works for one client. Still you must initdb form original
sources (there is bug somewhere in initdb (there is read or write over
mmaped segment). Autovaac, etc is killer. I think I preserve WAL before
data, and I hope I use shared buffers. It's still really buggy.

Sometimes db crashes form other processes, and I didn't attached crash
report for those.

Regards,
Radek

Attachment	Content-Type	Size
pg_mmap_20110331_writing.diff.bz2	application/x-bzip2	132.2 KB

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-03-31 23:41:25
Message-ID:	4D951125.3030206@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/24/2011 03:36 PM, Jim Nasby wrote:
> On Mar 23, 2011, at 5:12 PM, Tom Lane wrote:
>
>> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>>
>>> It looks like the only way anything can ever get put on the free list
>>> right now is if a relation or database is dropped. That doesn't seem
>>> too good.
>>>
>> Why not? AIUI the free list is only for buffers that are totally dead,
>> ie contain no info that's possibly of interest to anybody. It is *not*
>> meant to substitute for running the clock sweep when you have to discard
>> a live buffer.
>>
> Turns out we've had this discussion before: http://archives.postgresql.org/pgsql-hackers/2010-12/msg01088.php and http://archives.postgresql.org/pgsql-hackers/2010-12/msg00689.php
>

Investigating this has been on the TODO list for four years now:

http://archives.postgresql.org/pgsql-hackers/2007-04/msg00781.php

I feel that work in this area is blocked behind putting together a
decent mix of benchmarks that can be used to test whether changes here
are actually good or bad. All of the easy changes to buffer allocation
strategy, ones that you could verify by inspection and simple tests,
were made in 8.3. The stuff that's left has the potential to either
improve or reduce performance, and which will happen is very workload
dependent.

Setting up systematic benchmarks of multiple workloads to run
continuously on big hardware is a large, boring, expensive problem that
few can justify financing (except for Jim of course), and even fewer
want to volunteer time toward. This whole discussion of cache policy
tweaks is fun, but I just delete all the discussion now because it's
just going in circles without a good testing regime. The right way to
start is by saying "this is the benchmark I'm going to improve with this
change, and it has a profiled hotspot at this point".

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: 2nd Level Buffer Cache
Date:	2011-04-26 19:49:31
Message-ID:	201104261949.p3QJnVA14947@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus wrote:
>
> > Was it really all that bad? IIRC we replaced ARC with the current clock
> > sweep due to patent concerns. (Maybe there were performance concerns as
> > well, I don't remember).
>
> Yeah, that was why the patent was frustrating. Performance was poor and
> we were planning on replacing ARC in 8.2 anyway. Instead we had to
> backport it.

[ Replying late.]

FYI, the performance problem was that while ARC was slightly better than
clock sweep in keeping useful buffers in the cache, it was terrible when
multiple CPUs were all modifying the buffer cache, which is why we were
going to remove it anyway.

In summary, any new algorithm has to be better at keeping useful data in
the cache, and also not slow down workloads on multiple CPUs.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +