Quick Links

Re: Experimental ARC implementation

Lists:	pgsql-hackers

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Experimental ARC implementation
Date:	2003-11-03 15:33:21
Message-ID:	3FA67541.9020105@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attached is a first trial implementation of the Adaptive Replacement
Cache (ARC). The patch is against CVS HEAD of 7.4.

The algorithm is based on what's described in these papers:

http://www.almaden.ibm.com/StorageSystems/autonomic_storage/ARC/arcfast.pdf
http://www.almaden.ibm.com/StorageSystems/autonomic_storage/ARC/oneup.pdf

kindly pointed to by Cuong Bui a few days ago.

To work with PostgreSQL and to mostly keep our differentiation between
the buffer manager and the cache replacement policy I implemented it as
a set of (hopefully) more strategy independent, separate calls that
mostly replace freelist.c. The main problem why the code differs
optically from the pseudo code found in oneup.pdf is, that it needs to
be concurrency safe. In particular the underlying list structures can be
changed by other backends during IO, and that happens half way through
the algorithm.

The major change to the existing freelist is an additional layer of
indirection. The original buffer descriptors now only hold a cache page
and it's clean/dirty information and such. The new layer is the cache
directory who's size is different from the number of data blocks in the
buffer cache.

With a TPC-C like transaction profile and a buffer cache size that
allows 10-15% of the database to stay resident, I am not able to see any
dramatic change in behaviour or performance. At the maximum the
difference could be called a "light relief". But that transaction
profile is not a good mix compared to any real world application anyway,
so I wouldn't give too much on that result. It's also possible that I
made some mistakes in the implementation, so have a critical eye on that.

When playing with it please keep a few facts in mind. A 64MB cache
(shared_buffers=8192) needs about 10 minutes of full transaction load to
populate, in benchmark terms this is called ramp-up time. The algorithm
is aiming at scan resistance, so a pure index access approach will not
show any significant changes. Therefore, don't tell me that a 1x scale
10 client by 1000 transactions pgbench run doesn't show any performance
boost, it cannot.

I will follow up shortly with an approach that integrates Tom's delay
mechanism plus my first READ_BY_VACUUM hack into one combined experiement.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

Attachment	Content-Type	Size
cache_strategy_ARC.74.diff	text/plain	36.4 KB

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Experimental ARC implementation
Date:	2003-11-03 21:56:33
Message-ID:	3FA6CF11.9000101@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:

> I will follow up shortly with an approach that integrates Tom's delay
> mechanism plus my first READ_BY_VACUUM hack into one combined experiement.

Okay,

the attached patch contains the 3 already discussed and one additional
change. I also made a few changes.

1) ARC policy. Has no configurable options as it is fully self tuning.

2) vacuum_page_delay. I changed the algorithm not to do a usleep() per
page. The milliseconds usleep() now is done every vacuum_page_groupsize
pages. This makes especially sense if one can only tune in 10ms intervals.

3) Pages faulted in by VACUUM are placed onto the T1 head (LRU position)
in the ARC strategy. Thereby vacuum is even more conservative about
cache pollution than a sequential scan now.

4) A new config option lazy_checkpoint_time causes automatic timeout
controlled checkpoints (the background ones done by postmaster - and
only those) to spread out the BufferSync() over the specified amount of
seconds. This activity does not include the smgrsync() which will
actually cause the kernel to force the stuff out to disk. But it gives
the kernel time to do something already, and thereby possibly shortening
the IO burst caused by smgrsync.

I started with

vacuum_page_delay = 100
vacuum_page_groupsize = 25
checkpoint_timeout = 600
lazy_checkpoint_time = 300

and the system runs a lot smoother than before. The only not addressed
performance drop occurs now after flushing all buffers and finally
syncing during the checkpoints. And I don't have any idea how to tackle
that one.

Comments/Feedback?

Jan

Attachment	Content-Type	Size
all_performance.74.diff.gz	application/x-gzip	13.4 KB

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Experimental ARC implementation
Date:	2003-11-04 14:14:58
Message-ID:	3FA7B462.7070709@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> Jan Wieck wrote:
>
>> I will follow up shortly with an approach that integrates Tom's delay
>> mechanism plus my first READ_BY_VACUUM hack into one combined experiement.
>
> Okay,
>
> the attached patch contains the 3 already discussed and one additional
> change.

Ooopsy

the B1/B2 queue length adjustment in that one was totally nonsense. This
one behaves much better.

I added a DEBUG1 elog every 10 seconds to monitor the cache hitrates and
cache size adjustments. It's pretty neat to watch how it responds to
running an OLTP kind of thing and then issue VACUUM and run big
reporting sequential suckers in parallel.

Jan

Attachment	Content-Type	Size
all_performance.v2.74.diff.gz	application/x-gzip	13.8 KB

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Experimental ARC implementation
Date:	2003-11-04 20:39:51
Message-ID:	3FA80E97.1070608@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:

> Jan Wieck wrote:
>> Jan Wieck wrote:
>>
>>> I will follow up shortly with an approach that integrates Tom's delay
>>> mechanism plus my first READ_BY_VACUUM hack into one combined experiement.
>>
>> Okay,
>>
>> the attached patch contains the 3 already discussed and one additional
>> change.
>
> Ooopsy

Ooops-2

but I'm getting closer.

I guess I polluted the list enough. The latest patch is now here:

http://developer.postgresql.org/~wieck/all_performance.v3.74.diff.gz

This one now correctly keeps T1len+B1len at about the number of buffers,
which is half the directory size. The former versions favored T1 too much.

It also contains the starting work of the discussed background buffer
writer. Thus far, the BufferSync() done at a checkpoint only writes out
all dirty blocks in their LRU order and over a configurable time
(lazy_checkpoint_time in seconds). But that means at least, while the
checkpoint is running the backends should not need to flush dirty
buffers as well, since all the candidates they get for replacement are
clean. My plan is to create another background process very similar to
the checkpointer and to let that run forever basically looping over that
BufferSync() with a bool telling that it's the bg_writer.

Jan

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Experimental ARC implementation
Date:	2003-11-07 03:48:29
Message-ID:	200311070348.hA73mTq06434@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> It also contains the starting work of the discussed background buffer
> writer. Thus far, the BufferSync() done at a checkpoint only writes out
> all dirty blocks in their LRU order and over a configurable time
> (lazy_checkpoint_time in seconds). But that means at least, while the
> checkpoint is running the backends should not need to flush dirty
> buffers as well, since all the candidates they get for replacement are
> clean. My plan is to create another background process very similar to
> the checkpointer and to let that run forever basically looping over that
> BufferSync() with a bool telling that it's the bg_writer.

Have you considered having the background writer check the pages it is
about to write to see if they can be added to the FSM, thereby reducing
the need for vacuum? Seems we would need to add a statistics parameter
so pg_autovacuum would know how many tuples the background write added
to the freespace map, so it doesn't vacuum a table that doesn't need it.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Experimental ARC implementation
Date:	2003-11-07 04:08:36
Message-ID:	25585.1068178116@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Have you considered having the background writer check the pages it is
> about to write to see if they can be added to the FSM, thereby reducing
> the need for vacuum?

The 7.4 rewrite of FSM depends on the assumption that all the free space
in a given relation is reported to FSM in one batch (ie, at the end of a
VACUUM pass). This solves problems in both speed (page-at-a-time update
of FSM was horrendously expensive) and space allocation policy (we now
use the number of "interesting" pages in each relation as input
information for the allocation policy). To do anything like the above,
you'd need to find different solutions to these problems.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Experimental ARC implementation
Date:	2003-11-07 04:36:00
Message-ID:	200311070436.hA74a0Q11581@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Have you considered having the background writer check the pages it is
> > about to write to see if they can be added to the FSM, thereby reducing
> > the need for vacuum?
>
> The 7.4 rewrite of FSM depends on the assumption that all the free space
> in a given relation is reported to FSM in one batch (ie, at the end of a
> VACUUM pass). This solves problems in both speed (page-at-a-time update
> of FSM was horrendously expensive) and space allocation policy (we now
> use the number of "interesting" pages in each relation as input
> information for the allocation policy). To do anything like the above,
> you'd need to find different solutions to these problems.

Yea, shame. I never liked sequentially scanning a huge table just to
find the few free tuples.

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Experimental ARC implementation
Date:	2003-11-07 06:31:56
Message-ID:	87islwlmeb.fsf@stark.dyndns.tv
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:

> Have you considered having the background writer check the pages it is
> about to write to see if they can be added to the FSM, thereby reducing
> the need for vacuum? Seems we would need to add a statistics parameter
> so pg_autovacuum would know how many tuples the background write added
> to the freespace map, so it doesn't vacuum a table that doesn't need it.

This would suffer from the previously mentioned problem of having to pull in
index pages and dirty them when it's trying to flush and clean pages.

Conceivably it could just count up the dead tuples and provide that
information to something like pg_autovacuum so it knows when it's time to run
a vacuum. I don't see that as all that much of a win over the current
heuristics. At best it means a big batch update will trigger a vacuum sooner
so you don't have to manually run vacuum to avoid overflowing the fsm.

--
greg

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Experimental ARC implementation
Date:	2003-11-07 12:03:18
Message-ID:	200311071203.hA7C3I207025@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark wrote:
>
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>
> > Have you considered having the background writer check the pages it is
> > about to write to see if they can be added to the FSM, thereby reducing
> > the need for vacuum? Seems we would need to add a statistics parameter
> > so pg_autovacuum would know how many tuples the background write added
> > to the freespace map, so it doesn't vacuum a table that doesn't need it.
>
> This would suffer from the previously mentioned problem of having to pull in
> index pages and dirty them when it's trying to flush and clean pages.

I am confused why the index would be involved in this.

> Conceivably it could just count up the dead tuples and provide that
> information to something like pg_autovacuum so it knows when it's time to run
> a vacuum. I don't see that as all that much of a win over the current
> heuristics. At best it means a big batch update will trigger a vacuum sooner
> so you don't have to manually run vacuum to avoid overflowing the fsm.

Yea, probably. Another idea would be for tuple reuse to favor pages
already in the cache so it doesn't have to read in a page from disk.

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Experimental ARC implementation
Date:	2003-11-07 15:32:36
Message-ID:	3FABBB14.8050604@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian wrote:

> Greg Stark wrote:
>>
>> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>>
>> > Have you considered having the background writer check the pages it is
>> > about to write to see if they can be added to the FSM, thereby reducing
>> > the need for vacuum? Seems we would need to add a statistics parameter
>> > so pg_autovacuum would know how many tuples the background write added
>> > to the freespace map, so it doesn't vacuum a table that doesn't need it.
>>
>> This would suffer from the previously mentioned problem of having to pull in
>> index pages and dirty them when it's trying to flush and clean pages.
>
> I am confused why the index would be involved in this.

Looks Greg mixed up background vacuuming with background fsming, the
problem is a vacuum-some-page-at-hand problem, not related to FSM.

>
>> Conceivably it could just count up the dead tuples and provide that
>> information to something like pg_autovacuum so it knows when it's time to run
>> a vacuum. I don't see that as all that much of a win over the current
>> heuristics. At best it means a big batch update will trigger a vacuum sooner
>> so you don't have to manually run vacuum to avoid overflowing the fsm.
>
> Yea, probably. Another idea would be for tuple reuse to favor pages
> already in the cache so it doesn't have to read in a page from disk.

It has little chance to remember how many dead tuples where added
between this flush and it's corresponding read, or more precise the last
time this buffer got flushed. And we certainly don't want to modify the
on-disk block structure just for that.

Jan