Re: Experimental patch for inter-page delay in VACUUM

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Experimental patch for inter-page delay in VACUUM
Date: 2003-10-31 02:40:33
Message-ID: 5464.1067568033@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Attached is an extremely crude prototype patch for making VACUUM delay
by a configurable amount between pages, in hopes of throttling its disk
bandwidth consumption. By default, there is no delay (so no change in
behavior). In some quick testing, setting vacuum_page_delay to 10
(milliseconds) seemed to greatly reduce a background VACUUM's impact
on pgbench timing on an underpowered machine. Of course, it also makes
VACUUM a lot slower, but that's probably not a serious concern for
background VACUUMs.

I am not proposing this for application to the master sources yet, but
I would be interested to get some feedback from people who see serious
performance degradation while VACUUM is running. Does it help? What do
you find to be a good setting for vacuum_page_delay?

Assuming that this is found to be useful, the following issues would
have to be dealt with before the patch would be production quality:

1. The patch depends on usleep() which is not present on all platforms,
and may have unwanted side-effects on SIGALRM processing on some
platforms. We'd need to replace that with something else, probably
a select() call.

2. I only bothered to insert delays in the processing loops of plain
VACUUM and btree index cleanup. VACUUM FULL and cleanup of non-btree
indexes aren't done yet.

3. No documentation...

The patch is against CVS tip, but should apply cleanly to any recent
7.4 beta. You could likely adapt it to 7.3 without much effort.

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 3.2 KB

From: "Matthew T(dot) O'Connor" <matthew(at)zeut(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-10-31 05:16:40
Message-ID: 3FA1F038.8050804@zeut.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

>Attached is an extremely crude prototype patch for making VACUUM delay
>by a configurable amount between pages,
>
Cool!

>Assuming that this is found to be useful, the following issues would
>have to be dealt with before the patch would be production quality:
>
>2. I only bothered to insert delays in the processing loops of plain
> VACUUM and btree index cleanup. VACUUM FULL and cleanup of non-btree
> indexes aren't done yet.
>
>
I thought we didn't want the delay in vacuum full since it locks things
down, we want vacuum full to finish ASAP. As opposed to normal vacuum
which would be fired by the autovacuum daemon.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Matthew T(dot) O'Connor" <matthew(at)zeut(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-10-31 05:53:14
Message-ID: 7473.1067579594@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Matthew T. O'Connor" <matthew(at)zeut(dot)net> writes:
> Tom Lane wrote:
>> 2. I only bothered to insert delays in the processing loops of plain
>> VACUUM and btree index cleanup. VACUUM FULL and cleanup of non-btree
>> indexes aren't done yet.
>>
> I thought we didn't want the delay in vacuum full since it locks things
> down, we want vacuum full to finish ASAP. As opposed to normal vacuum
> which would be fired by the autovacuum daemon.

My thought was that it'd be up to the user to set vacuum_page_delay
appropriately for what he is doing. It might or might not ever make
sense to use a nonzero delay in VACUUM FULL, but the facility should be
there. (Since plain and full VACUUM share the same index cleanup code,
it would take some klugery to implement a policy of "no delays for
VACUUM FULL" anyway.)

Best practice would likely be to leave the default vacuum_page_delay at
zero, and have the autovacuum daemon set a nonzero value for vacuums it
issues.

regards, tom lane


From: "Stephen" <jleelim(at)xxxxxx(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-10-31 16:11:53
Message-ID: wQvob.2098$qV5.1493@nntp-post.primus.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Great! I haven't tried it yet, but I love the thought of it already :-)
I've been waiting for something like this for the past 2 years and now it's
going to make my multi-gigabyte PostgreSQL more usable and responsive. Will
the delay be tunable per VACUUM invocation? This is needed for different
tables that require different VACUUM priorities (eg. For small tables that
are rarely used, I rather vacuum with zero delay. For big tables, I'd set a
reasonable delay in vacuum and let it run through the day & nite).

Regards,

Stephen

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in message
news:7473(dot)1067579594(at)sss(dot)pgh(dot)pa(dot)us(dot)(dot)(dot)
> "Matthew T. O'Connor" <matthew(at)zeut(dot)net> writes:
> > Tom Lane wrote:
> >> 2. I only bothered to insert delays in the processing loops of plain
> >> VACUUM and btree index cleanup. VACUUM FULL and cleanup of non-btree
> >> indexes aren't done yet.
> >>
> > I thought we didn't want the delay in vacuum full since it locks things
> > down, we want vacuum full to finish ASAP. As opposed to normal vacuum
> > which would be fired by the autovacuum daemon.
>
> My thought was that it'd be up to the user to set vacuum_page_delay
> appropriately for what he is doing. It might or might not ever make
> sense to use a nonzero delay in VACUUM FULL, but the facility should be
> there. (Since plain and full VACUUM share the same index cleanup code,
> it would take some klugery to implement a policy of "no delays for
> VACUUM FULL" anyway.)
>
> Best practice would likely be to leave the default vacuum_page_delay at
> zero, and have the autovacuum daemon set a nonzero value for vacuums it
> issues.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Matthew T(dot) O'Connor" <matthew(at)zeut(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-10-31 16:42:20
Message-ID: 200310311642.h9VGgKW29891@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> "Matthew T. O'Connor" <matthew(at)zeut(dot)net> writes:
> > Tom Lane wrote:
> >> 2. I only bothered to insert delays in the processing loops of plain
> >> VACUUM and btree index cleanup. VACUUM FULL and cleanup of non-btree
> >> indexes aren't done yet.
> >>
> > I thought we didn't want the delay in vacuum full since it locks things
> > down, we want vacuum full to finish ASAP. As opposed to normal vacuum
> > which would be fired by the autovacuum daemon.
>
> My thought was that it'd be up to the user to set vacuum_page_delay
> appropriately for what he is doing. It might or might not ever make
> sense to use a nonzero delay in VACUUM FULL, but the facility should be
> there. (Since plain and full VACUUM share the same index cleanup code,
> it would take some klugery to implement a policy of "no delays for
> VACUUM FULL" anyway.)
>
> Best practice would likely be to leave the default vacuum_page_delay at
> zero, and have the autovacuum daemon set a nonzero value for vacuums it
> issues.

What is the advantage of delaying vacuum per page vs. just doing vacuum
less frequently?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: "Matthew T(dot) O'Connor" <matthew(at)zeut(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-10-31 17:05:06
Message-ID: 4211.1067619906@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> What is the advantage of delaying vacuum per page vs. just doing vacuum
> less frequently?

The point is the amount of load VACUUM poses while it's running. If
your setup doesn't have a lot of disk bandwidth to spare, a background
VACUUM can hurt the performance of your foreground applications quite
a bit. Running it less often doesn't improve this issue at all.

regards, tom lane


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Matthew T(dot) O'Connor" <matthew(at)zeut(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-10-31 17:22:50
Message-ID: 3FA29A6A.2000108@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:

> Tom Lane wrote:
>> "Matthew T. O'Connor" <matthew(at)zeut(dot)net> writes:
>> > Tom Lane wrote:
>> >> 2. I only bothered to insert delays in the processing loops of plain
>> >> VACUUM and btree index cleanup. VACUUM FULL and cleanup of non-btree
>> >> indexes aren't done yet.
>> >>
>> > I thought we didn't want the delay in vacuum full since it locks things
>> > down, we want vacuum full to finish ASAP. As opposed to normal vacuum
>> > which would be fired by the autovacuum daemon.
>>
>> My thought was that it'd be up to the user to set vacuum_page_delay
>> appropriately for what he is doing. It might or might not ever make
>> sense to use a nonzero delay in VACUUM FULL, but the facility should be
>> there. (Since plain and full VACUUM share the same index cleanup code,
>> it would take some klugery to implement a policy of "no delays for
>> VACUUM FULL" anyway.)
>>
>> Best practice would likely be to leave the default vacuum_page_delay at
>> zero, and have the autovacuum daemon set a nonzero value for vacuums it
>> issues.
>
> What is the advantage of delaying vacuum per page vs. just doing vacuum
> less frequently?

It gives regular backends more time to "retouch" the pages they actually
need before they fall off the end of the LRU list.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Christopher Browne <cbbrowne(at)libertyrms(dot)info>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-10-31 17:54:11
Message-ID: 60ad7h70os.fsf@dev6.int.libertyrms.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

pgman(at)candle(dot)pha(dot)pa(dot)us (Bruce Momjian) writes:
> Tom Lane wrote:
>> Best practice would likely be to leave the default vacuum_page_delay at
>> zero, and have the autovacuum daemon set a nonzero value for vacuums it
>> issues.
>
> What is the advantage of delaying vacuum per page vs. just doing vacuum
> less frequently?

If the vacuum is deferred, that merely means that you put off the
"slow to a crawl" until a bit later. It is a given that the system
will slow to a crawl for the duration of the vacuum; you are merely
putting it off a bit.

The advantage of the per-page delay is that performance is not being
"totally hammered" by the vacuum. If things are so busy that it's an
issue, the system is liable to "limp somewhat," but that's not as bad
as what we see now, where VACUUM and other activity are 'dueling' for
access to I/O. Per-page delay means that VACUUM mostly defers to the
other activity, limiting how badly it hurts other performance.
--
output = reverse("ofni.smrytrebil" "@" "enworbbc")
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Stephen" <jleelim(at)xxxxxx(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-10-31 19:07:01
Message-ID: 6254.1067627221@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Stephen" <jleelim(at)xxxxxx(dot)com> writes:
> Great! I haven't tried it yet, but I love the thought of it already :-)
> I've been waiting for something like this for the past 2 years and now it's
> going to make my multi-gigabyte PostgreSQL more usable and responsive. Will
> the delay be tunable per VACUUM invocation?

As the patch is set up, you just do "SET vacuum_page_delay = n" and
then VACUUM.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Christopher Browne <cbbrowne(at)libertyrms(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-10-31 20:54:16
Message-ID: 19653.1067633656@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Christopher Browne <cbbrowne(at)libertyrms(dot)info> writes:
> The advantage of the per-page delay is that performance is not being
> "totally hammered" by the vacuum. If things are so busy that it's an
> issue, the system is liable to "limp somewhat," but that's not as bad
> as what we see now, where VACUUM and other activity are 'dueling' for
> access to I/O. Per-page delay means that VACUUM mostly defers to the
> other activity, limiting how badly it hurts other performance.

... or that's the theory, anyway. The point of putting up this patch
is for people to experiment to find out if it really helps.

regards, tom lane


From: "Stephen" <jleelim(at)xxxxxx(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-10-31 21:20:42
Message-ID: %lAob.2593$h9.1441@nntp-post.primus.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I tried the Tom Lane's patch on PostgreSQL 7.4-BETA-5 and it works
fantastically! Running a few short tests show a significant improvement in

responsiveness on my RedHat 9 Linux 2.4-20-8 (IDE 120GB 7200RPM UDMA5).

I didn't feel any noticeable delay when vacuum_page_delay is set to 5ms, 10
ms. Vacuum takes 15 to 24 times longer to complete (as expected)

but I don't mind at all. Vmstat BI/BO load is reduced by 5 times when
vacuum_page_delay = 1ms. Load average reduced significantly

also as there are less processes waiting to complete. I find a value of 1ms
to 5ms is quite good and will keep system responsive. Going from 10ms to 1ms
didn't seem to reduce the total vacuum time by much and I'm not sure why.

Any chance we can get this patched into 7.4 permanently?

I cannot say how well it would work on a heavy load, but on a light load
this patch is highly recommended for 24/7 large DB systems. The

database is mostly read-only. There are 133,000 rows and each row is about
2.5kB in size (mostly due to the bytea column holding a binary

image). The long row causes system to TOAST the table. I repeatedly ran the
following tests while system is idling:

Normal operation with no VACUUM
===============================

tsdb=# explain analyze select * from table1 where id =
'0078997ac809877c1a0d1f76af753608';
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=19.030..19.036 rows=1 loops=1)
Index Cond: ((id)::text = '0078997ac809877c1a0d1f76af753608'::text)
Total runtime: 19.206 ms
(3 rows)

VACUUM at vacuum_page_delay = 0
===============================

-bash-2.05b$ vmstat 1
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us sy
id
0 1 0 176844 3960 17748 146704 0 0 1408 0 296 556 0 1
99
0 1 0 176844 3960 17748 146264 0 0 1536 0 285 546 0 2
98

tsdb=# explain analyze select * from table1 where id =
'00e5ae5f4fddab371f7847f7da65eebb';
QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=298.028..298.047 rows=1 loops=1)
Index Cond: ((id)::text = '0036edc4a92b6afd41304c6c8b76bc3c'::text)
Total runtime: 298.275 ms
(3 rows)

tsdb=# explain analyze select * from table1 where id =
'0046751ac3ec290b9f66ea1d66431923';
QUERY PLAN
----------------------------------------------------------------------------
--------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=454.727..454.746 rows=1 loops=1)
Index Cond: ((id)::text = '0046751ac3ec290b9f66ea1d66431923'::text)
Total runtime: 454.970 ms
(3 rows)

tsdb=# explain analyze select * from table1 where id =
'00a74e6885579a2d50487f5a1dceba22';
QUERY PLAN
----------------------------------------------------------------------------
--------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=344.483..344.501 rows=1 loops=1)
Index Cond: ((id)::text = '00a74e6885579a2d50487f5a1dceba22'::text)
Total runtime: 344.700 ms
(3 rows)

VACUUM at vacuum_page_delay = 1
===============================

procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us sy
id
0 0 0 176840 4292 23700 137416 0 0 384 0 127 302 0 0
100
0 0 0 176840 4220 23700 137116 0 0 512 0 118 286 0 0
100
1 0 0 176840 4220 23700 136656 0 0 384 0 132 303 0 1
99

tsdb=# explain analyze select * from table1 where id =
'003d5966f8b9a06e4b0fff9fa8e93be0';
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=74.575..74.584 rows=1 loops=1)
Index Cond: ((id)::text = '003d5966f8b9a06e4b0fff9fa8e93be0'::text)
Total runtime: 74.761 ms
(3 rows)

tsdb=# explain analyze select * from table1 where id =
'00677fe46cd0af3d98564068f34db1cf';
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=31.779..31.785 rows=1 loops=1)
Index Cond: ((id)::text = '00677fe46cd0af3d98564068f34db1cf'::text)
Total runtime: 31.954 ms
(3 rows)

tsdb=# explain analyze select * from table1 where id =
'00b7c3e2fffdf39ff4ac50add04336b7';
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=78.974..78.989 rows=1 loops=1)
Index Cond: ((id)::text = '00b7c3e2fffdf39ff4ac50add04336b7'::text)
Total runtime: 79.172 ms
(3 rows)

tsdb=# explain analyze select * from table1 where id =
'008d49c007f711d5f5ec48b67a8e58f0';
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=30.143..30.148 rows=1 loops=1)
Index Cond: ((id)::text = '008d49c007f711d5f5ec48b67a8e58f0'::text)
Total runtime: 30.315 ms
(3 rows)

VACUUM at vacuum_page_delay = 5
===============================

-bash-2.05b$ vmstat 1
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us sy
id
0 0 0 176840 4228 22668 138212 0 0 512 0 117 276 0 0
100
0 0 0 176840 4220 22668 138212 0 0 384 0 132 296 0 1
99
0 0 0 176840 4220 22668 137764 0 0 384 0 114 276 0 0
100

tsdb=# explain analyze select * from table1 where id =
'000aa16ffe019fa327b68b7e610e5ac0';
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=14.089..14.094 rows=1 loops=1)
Index Cond: ((id)::text = '000aa16ffe019fa327b68b7e610e5ac0'::text)
Total runtime: 14.252 ms
(3 rows)

tsdb=# explain analyze select * from table1 where id =
'00aacc4684577737498df0536be1fac8';
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=16.747..16.752 rows=1 loops=1)
Index Cond: ((id)::text = '00aacc4684577737498df0536be1fac8'::text)
Total runtime: 16.910 ms
(3 rows)

tsdb=# explain analyze select * from table1 where id =
'00e295f5644d4cb77a5ebc4efbbaa770';
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=16.684..16.690 rows=1 loops=1)
Index Cond: ((id)::text = '00e295f5644d4cb77a5ebc4efbbaa770'::text)
Total runtime: 16.886 ms
(3 rows)

VACUUM at vacuum_page_delay = 10
================================

-bash-2.05b$ vmstat 1
procs memory swap io system
cpu
r b w swpd free buff cache si so bi bo in cs us sy
id
0 0 0 176840 4336 20968 139780 0 0 384 108 121 294 0 0
100
0 0 0 176840 4336 20968 140164 0 0 384 0 130 281 0 1
99

tsdb=# explain analyze select * from table1 where id =
'007841017b9f7c80394f2bb4314ba8c1';
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=19.576..19.587 rows=1 loops=1)
Index Cond: ((id)::text = '007841017b9f7c80394f2bb4314ba8c1'::text)
Total runtime: 19.854 ms
(3 rows)

tsdb=# explain analyze select * from table1 where id =
'0070724846c4d0d0dbb8f3e939fd1da4';
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=10.616..10.624 rows=1 loops=1)
Index Cond: ((id)::text = '0070724846c4d0d0dbb8f3e939fd1da4'::text)
Total runtime: 10.795 ms
(3 rows)

tsdb=# explain analyze select * from table1 where id =
'00fc92bf0f5048d7680bd8fa2d4c6f3a';
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------------------
Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
(actual time=28.007..28.014 rows=1 loops=1)
Index Cond: ((id)::text = '00fc92bf0f5048d7680bd8fa2d4c6f3a'::text)
Total runtime: 28.183 ms
(3 rows)


From: "Matthew T(dot) O'Connor" <matthew(at)zeut(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephen <jleelim(at)xxxxxx(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-10-31 22:19:51
Message-ID: 3FA2E007.4040801@zeut.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

>"Stephen" <jleelim(at)xxxxxx(dot)com> writes:
>
>
>>Great! I haven't tried it yet, but I love the thought of it already :-)
>>I've been waiting for something like this for the past 2 years and now it's
>>going to make my multi-gigabyte PostgreSQL more usable and responsive. Will
>>the delay be tunable per VACUUM invocation?
>>
>>
>
>As the patch is set up, you just do "SET vacuum_page_delay = n" and
>then VACUUM.
>
>
probably a setting that autovacuum will tweak based on things like table
size etc.... If we can find a way to automatically tweak it that makes
sense.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Stephen" <jleelim(at)xxxxxx(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-01 19:12:49
Message-ID: 2254.1067713969@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Stephen" <jleelim(at)xxxxxx(dot)com> writes:
> also as there are less processes waiting to complete. I find a value of 1ms
> to 5ms is quite good and will keep system responsive. Going from 10ms to 1ms
> didn't seem to reduce the total vacuum time by much and I'm not sure why.

On most Unixen, the effective resolution of sleep requests is whatever
the scheduler time quantum is --- and 10ms is the standard quantum in
most cases. So any delay less than 10ms is going to be interpreted as
10ms.

I think on recent Linuxen it's possible to adjust the time quantum, but
whether this would be a net win isn't clear; presumably a shorter
quantum would result in more scheduler overhead and more process-swap
penalties.

regards, tom lane


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Stephen <jleelim(at)xxxxxx(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-01 22:42:45
Message-ID: 3FA436E5.8040506@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephen wrote:

> I tried the Tom Lane's patch on PostgreSQL 7.4-BETA-5 and it works
> fantastically! Running a few short tests show a significant improvement in
>
> responsiveness on my RedHat 9 Linux 2.4-20-8 (IDE 120GB 7200RPM UDMA5).

I am currently looking at implementing ARC as a replacement strategy. I
don't have anything that works yet, so I can't really tell what the
result would be and it might turn out that we want both features.

All I can say is that the theory looks like an extremely smart and
generalized version of the crude hack I had done. And that one is able
to lower the impact of VACUUM on the foreground clients while increasing
the VACUUM speed. The 7.3.4 version of my crude hack is attached.

Jan

>
> I didn't feel any noticeable delay when vacuum_page_delay is set to 5ms, 10
> ms. Vacuum takes 15 to 24 times longer to complete (as expected)
>
> but I don't mind at all. Vmstat BI/BO load is reduced by 5 times when
> vacuum_page_delay = 1ms. Load average reduced significantly
>
> also as there are less processes waiting to complete. I find a value of 1ms
> to 5ms is quite good and will keep system responsive. Going from 10ms to 1ms
> didn't seem to reduce the total vacuum time by much and I'm not sure why.
>
> Any chance we can get this patched into 7.4 permanently?
>
> I cannot say how well it would work on a heavy load, but on a light load
> this patch is highly recommended for 24/7 large DB systems. The
>
> database is mostly read-only. There are 133,000 rows and each row is about
> 2.5kB in size (mostly due to the bytea column holding a binary
>
> image). The long row causes system to TOAST the table. I repeatedly ran the
> following tests while system is idling:
>
>
> Normal operation with no VACUUM
> ===============================
>
> tsdb=# explain analyze select * from table1 where id =
> '0078997ac809877c1a0d1f76af753608';
> QUERY PLAN
> ----------------------------------------------------------------------------
> ------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=19.030..19.036 rows=1 loops=1)
> Index Cond: ((id)::text = '0078997ac809877c1a0d1f76af753608'::text)
> Total runtime: 19.206 ms
> (3 rows)
>
>
>
> VACUUM at vacuum_page_delay = 0
> ===============================
>
> -bash-2.05b$ vmstat 1
> procs memory swap io system
> cpu
> r b w swpd free buff cache si so bi bo in cs us sy
> id
> 0 1 0 176844 3960 17748 146704 0 0 1408 0 296 556 0 1
> 99
> 0 1 0 176844 3960 17748 146264 0 0 1536 0 285 546 0 2
> 98
>
>
> tsdb=# explain analyze select * from table1 where id =
> '00e5ae5f4fddab371f7847f7da65eebb';
> QUERY PLAN
> ----------------------------------------------------------------------------
> ----------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=298.028..298.047 rows=1 loops=1)
> Index Cond: ((id)::text = '0036edc4a92b6afd41304c6c8b76bc3c'::text)
> Total runtime: 298.275 ms
> (3 rows)
>
> tsdb=# explain analyze select * from table1 where id =
> '0046751ac3ec290b9f66ea1d66431923';
> QUERY PLAN
> ----------------------------------------------------------------------------
> --------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=454.727..454.746 rows=1 loops=1)
> Index Cond: ((id)::text = '0046751ac3ec290b9f66ea1d66431923'::text)
> Total runtime: 454.970 ms
> (3 rows)
>
> tsdb=# explain analyze select * from table1 where id =
> '00a74e6885579a2d50487f5a1dceba22';
> QUERY PLAN
> ----------------------------------------------------------------------------
> --------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=344.483..344.501 rows=1 loops=1)
> Index Cond: ((id)::text = '00a74e6885579a2d50487f5a1dceba22'::text)
> Total runtime: 344.700 ms
> (3 rows)
>
>
> VACUUM at vacuum_page_delay = 1
> ===============================
>
> procs memory swap io system
> cpu
> r b w swpd free buff cache si so bi bo in cs us sy
> id
> 0 0 0 176840 4292 23700 137416 0 0 384 0 127 302 0 0
> 100
> 0 0 0 176840 4220 23700 137116 0 0 512 0 118 286 0 0
> 100
> 1 0 0 176840 4220 23700 136656 0 0 384 0 132 303 0 1
> 99
>
>
> tsdb=# explain analyze select * from table1 where id =
> '003d5966f8b9a06e4b0fff9fa8e93be0';
> QUERY PLAN
> ----------------------------------------------------------------------------
> ------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=74.575..74.584 rows=1 loops=1)
> Index Cond: ((id)::text = '003d5966f8b9a06e4b0fff9fa8e93be0'::text)
> Total runtime: 74.761 ms
> (3 rows)
>
> tsdb=# explain analyze select * from table1 where id =
> '00677fe46cd0af3d98564068f34db1cf';
> QUERY PLAN
> ----------------------------------------------------------------------------
> ------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=31.779..31.785 rows=1 loops=1)
> Index Cond: ((id)::text = '00677fe46cd0af3d98564068f34db1cf'::text)
> Total runtime: 31.954 ms
> (3 rows)
>
> tsdb=# explain analyze select * from table1 where id =
> '00b7c3e2fffdf39ff4ac50add04336b7';
> QUERY PLAN
> ----------------------------------------------------------------------------
> ------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=78.974..78.989 rows=1 loops=1)
> Index Cond: ((id)::text = '00b7c3e2fffdf39ff4ac50add04336b7'::text)
> Total runtime: 79.172 ms
> (3 rows)
>
> tsdb=# explain analyze select * from table1 where id =
> '008d49c007f711d5f5ec48b67a8e58f0';
> QUERY PLAN
> ----------------------------------------------------------------------------
> ------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=30.143..30.148 rows=1 loops=1)
> Index Cond: ((id)::text = '008d49c007f711d5f5ec48b67a8e58f0'::text)
> Total runtime: 30.315 ms
> (3 rows)
>
>
> VACUUM at vacuum_page_delay = 5
> ===============================
>
> -bash-2.05b$ vmstat 1
> procs memory swap io system
> cpu
> r b w swpd free buff cache si so bi bo in cs us sy
> id
> 0 0 0 176840 4228 22668 138212 0 0 512 0 117 276 0 0
> 100
> 0 0 0 176840 4220 22668 138212 0 0 384 0 132 296 0 1
> 99
> 0 0 0 176840 4220 22668 137764 0 0 384 0 114 276 0 0
> 100
>
>
> tsdb=# explain analyze select * from table1 where id =
> '000aa16ffe019fa327b68b7e610e5ac0';
> QUERY PLAN
> ----------------------------------------------------------------------------
> ------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=14.089..14.094 rows=1 loops=1)
> Index Cond: ((id)::text = '000aa16ffe019fa327b68b7e610e5ac0'::text)
> Total runtime: 14.252 ms
> (3 rows)
>
> tsdb=# explain analyze select * from table1 where id =
> '00aacc4684577737498df0536be1fac8';
> QUERY PLAN
> ----------------------------------------------------------------------------
> ------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=16.747..16.752 rows=1 loops=1)
> Index Cond: ((id)::text = '00aacc4684577737498df0536be1fac8'::text)
> Total runtime: 16.910 ms
> (3 rows)
>
> tsdb=# explain analyze select * from table1 where id =
> '00e295f5644d4cb77a5ebc4efbbaa770';
> QUERY PLAN
> ----------------------------------------------------------------------------
> ------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=16.684..16.690 rows=1 loops=1)
> Index Cond: ((id)::text = '00e295f5644d4cb77a5ebc4efbbaa770'::text)
> Total runtime: 16.886 ms
> (3 rows)
>
> VACUUM at vacuum_page_delay = 10
> ================================
>
> -bash-2.05b$ vmstat 1
> procs memory swap io system
> cpu
> r b w swpd free buff cache si so bi bo in cs us sy
> id
> 0 0 0 176840 4336 20968 139780 0 0 384 108 121 294 0 0
> 100
> 0 0 0 176840 4336 20968 140164 0 0 384 0 130 281 0 1
> 99
>
>
> tsdb=# explain analyze select * from table1 where id =
> '007841017b9f7c80394f2bb4314ba8c1';
> QUERY PLAN
> ----------------------------------------------------------------------------
> ------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=19.576..19.587 rows=1 loops=1)
> Index Cond: ((id)::text = '007841017b9f7c80394f2bb4314ba8c1'::text)
> Total runtime: 19.854 ms
> (3 rows)
>
> tsdb=# explain analyze select * from table1 where id =
> '0070724846c4d0d0dbb8f3e939fd1da4';
> QUERY PLAN
> ----------------------------------------------------------------------------
> ------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=10.616..10.624 rows=1 loops=1)
> Index Cond: ((id)::text = '0070724846c4d0d0dbb8f3e939fd1da4'::text)
> Total runtime: 10.795 ms
> (3 rows)
>
> tsdb=# explain analyze select * from table1 where id =
> '00fc92bf0f5048d7680bd8fa2d4c6f3a';
> QUERY PLAN
> ----------------------------------------------------------------------------
> ------------------------------------------------------
> Index Scan using table1_pkey on table1 (cost=0.00..6.01 rows=2 width=344)
> (actual time=28.007..28.014 rows=1 loops=1)
> Index Cond: ((id)::text = '00fc92bf0f5048d7680bd8fa2d4c6f3a'::text)
> Total runtime: 28.183 ms
> (3 rows)
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

Attachment Content-Type Size
vacuum_buffer_hack-7.3.4.diff text/plain 4.3 KB

From: "Stephen" <jleelim(at)xxxxxxx(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-02 05:45:08
Message-ID: GQ0pb.5642$Ts4.2774@nntp-post.primus.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

As it turns out. With vacuum_page_delay = 0, VACUUM took 1m20s (80s) to
complete, with vacuum_page_delay = 1 and vacuum_page_delay = 10, both
VACUUMs completed in 18m3s (1080 sec). A factor of 13 times! This is for a
single 350 MB table.

Apparently, it looks like the upcoming Linux kernel 2.6 will have a smaller
quantum:

http://go.jitbot.com/linux2.6-quantum

There is also mention of user-space tweak to get a more accurate time slice
of near 1ms on Linux, but I'm not sure how this works and if it applies to
Unixes:

http://go.jitbot.com/linux-devrtc-quantum

Regards, Stephen

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in message
news:2254(dot)1067713969(at)sss(dot)pgh(dot)pa(dot)us(dot)(dot)(dot)
> "Stephen" <jleelim(at)xxxxxx(dot)com> writes:
> > also as there are less processes waiting to complete. I find a value of
1ms
> > to 5ms is quite good and will keep system responsive. Going from 10ms to
1ms
> > didn't seem to reduce the total vacuum time by much and I'm not sure
why.
>
> On most Unixen, the effective resolution of sleep requests is whatever
> the scheduler time quantum is --- and 10ms is the standard quantum in
> most cases. So any delay less than 10ms is going to be interpreted as
> 10ms.
>
> I think on recent Linuxen it's possible to adjust the time quantum, but
> whether this would be a net win isn't clear; presumably a shorter
> quantum would result in more scheduler overhead and more process-swap
> penalties.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-02 18:00:35
Message-ID: 15456.1067796035@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> I am currently looking at implementing ARC as a replacement strategy. I
> don't have anything that works yet, so I can't really tell what the
> result would be and it might turn out that we want both features.

It's likely that we would. As someone (you?) already pointed out,
VACUUM has bad side-effects both in terms of cache flushing and in
terms of sheer I/O load. Those effects require different fixes AFAICS.

One thing that bothers me here is that I don't see how adjusting our
own buffer replacement strategy is going to do much of anything when
we cannot control the kernel's buffer replacement strategy. To get any
real traction we'd have to go back to the "take over most of RAM for
shared buffers" approach, which we already know to have a bunch of
severe disadvantages.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-02 21:20:50
Message-ID: 3FA57532.3020902@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Not surprising, I should have thought. Why would you care that much?
The idea as I understand it is to improve the responsiveness of things
happening alongside vacuum ("real work"). I normally run vacuum when I
don't expect anything else much to be happening - but I don't care how
long it takes (within reason), especially if it isn't going to intefere
with other uses.

cheers

andrew

Stephen wrote:

>As it turns out. With vacuum_page_delay = 0, VACUUM took 1m20s (80s) to
>complete, with vacuum_page_delay = 1 and vacuum_page_delay = 10, both
>VACUUMs completed in 18m3s (1080 sec). A factor of 13 times! This is for a
>single 350 MB table.
>
>Apparently, it looks like the upcoming Linux kernel 2.6 will have a smaller
>quantum:
>
>http://go.jitbot.com/linux2.6-quantum
>
>There is also mention of user-space tweak to get a more accurate time slice
>of near 1ms on Linux, but I'm not sure how this works and if it applies to
>Unixes:
>
>http://go.jitbot.com/linux-devrtc-quantum
>
>Regards, Stephen
>
>
>
>"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in message
>news:2254(dot)1067713969(at)sss(dot)pgh(dot)pa(dot)us(dot)(dot)(dot)
>
>
>>"Stephen" <jleelim(at)xxxxxx(dot)com> writes:
>>
>>
>>>also as there are less processes waiting to complete. I find a value of
>>>
>>>
>1ms
>
>
>>>to 5ms is quite good and will keep system responsive. Going from 10ms to
>>>
>>>
>1ms
>
>
>>>didn't seem to reduce the total vacuum time by much and I'm not sure
>>>
>>>
>why.
>
>
>>On most Unixen, the effective resolution of sleep requests is whatever
>>the scheduler time quantum is --- and 10ms is the standard quantum in
>>most cases. So any delay less than 10ms is going to be interpreted as
>>10ms.
>>
>>I think on recent Linuxen it's possible to adjust the time quantum, but
>>whether this would be a net win isn't clear; presumably a shorter
>>quantum would result in more scheduler overhead and more process-swap
>>penalties.
>>
>>regards, tom lane
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 6: Have you searched our list archives?
>>
>> http://archives.postgresql.org
>>
>>
>>
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 7: don't forget to increase your free space map settings
>
>
>


From: Hannu Krosing <hannu(at)tm(dot)ee>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-02 22:51:24
Message-ID: 1067813484.3357.20.camel@fuji.krosing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane kirjutas P, 02.11.2003 kell 20:00:
> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> > I am currently looking at implementing ARC as a replacement strategy. I
> > don't have anything that works yet, so I can't really tell what the
> > result would be and it might turn out that we want both features.
>
> It's likely that we would. As someone (you?) already pointed out,
> VACUUM has bad side-effects both in terms of cache flushing and in
> terms of sheer I/O load. Those effects require different fixes AFAICS.
>
> One thing that bothers me here is that I don't see how adjusting our
> own buffer replacement strategy is going to do much of anything when
> we cannot control the kernel's buffer replacement strategy.

At least for OpenSource/Free OS'es it would probably be possible to
persuade kernel developers to give the needed control to userspace apps.

So the "take over all RAM" is not the only option ;)

> To get any
> real traction we'd have to go back to the "take over most of RAM for
> shared buffers" approach, which we already know to have a bunch of
> severe disadvantages.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster


From: Christopher Browne <cbbrowne(at)acm(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-03 00:15:07
Message-ID: m3n0becnp0.fsf@wolfe.cbbrowne.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Centuries ago, Nostradamus foresaw when "Stephen" <jleelim(at)xxxxxxx(dot)com> would write:
> As it turns out. With vacuum_page_delay = 0, VACUUM took 1m20s (80s)
> to complete, with vacuum_page_delay = 1 and vacuum_page_delay = 10,
> both VACUUMs completed in 18m3s (1080 sec). A factor of 13 times!
> This is for a single 350 MB table.

While it is unfortunate that the minimum quanta seems to commonly be
10ms, it doesn't strike me as an enormous difficulty from a practical
perspective.

Well, actually, the case where it _would_ be troublesome would be
where there was a combination of huge tables needing vacuuming and
smaller ones that are _heavily_ updated (e.g. - account balances),
where pg_autovacuum might take so long on some big tables that it
wouldn't get to the smaller ones often enough.

But even in that case, I'm not sure the loss of control is necessarily
a vital problem. It certainly means that the cost of vacuuming has a
strictly limited "degrading" effect on performance.

It might be mitigated by the VACUUM CACHE notion I have suggested,
where a Real Quick Vacuum would go through just the pages that are
cached in memory, which would likely be quite effective at dealing
with heavily-updated balance tables...
--
If this was helpful, <http://svcs.affero.net/rm.php?r=cbbrowne> rate me
http://www3.sympatico.ca/cbbrowne/sap.html
Rules of the Evil Overlord #212. "I will not send out battalions
composed wholly of robots or skeletons against heroes who have qualms
about killing living beings. <http://www.eviloverlord.com/>


From: Gaetano Mendola <mendola(at)bigfoot(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-03 01:08:18
Message-ID: 3FA5AA82.60800@bigfoot.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Attached is an extremely crude prototype patch for making VACUUM delay
> by a configurable amount between pages, in hopes of throttling its disk
> bandwidth consumption. By default, there is no delay (so no change in
> behavior). In some quick testing, setting vacuum_page_delay to 10
> (milliseconds) seemed to greatly reduce a background VACUUM's impact
> on pgbench timing on an underpowered machine. Of course, it also makes
> VACUUM a lot slower, but that's probably not a serious concern for
> background VACUUMs.

[SNIP]

> The patch is against CVS tip, but should apply cleanly to any recent
> 7.4 beta. You could likely adapt it to 7.3 without much effort.

Will we have this on 7.4 ?
I tried it and improve a lot the responsiness of my queries just putting
the delay equal to 10 ms.

Regards
Gaetano Mendola


From: Hannu Krosing <hannu(at)tm(dot)ee>
To: Christopher Browne <cbbrowne(at)acm(dot)org>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-03 08:24:48
Message-ID: 1067847887.2580.1.camel@fuji.krosing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Christopher Browne kirjutas E, 03.11.2003 kell 02:15:
> Well, actually, the case where it _would_ be troublesome would be
> where there was a combination of huge tables needing vacuuming and
> smaller ones that are _heavily_ updated (e.g. - account balances),
> where pg_autovacuum might take so long on some big tables that it
> wouldn't get to the smaller ones often enough.

Can't one just run a _separate_ VACUUM on those smaller tables ?

------------
Hannu


From: Christopher Browne <cbbrowne(at)acm(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-03 13:22:45
Message-ID: m3znfdbn8a.fsf@wolfe.cbbrowne.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The world rejoiced as hannu(at)tm(dot)ee (Hannu Krosing) wrote:
> Christopher Browne kirjutas E, 03.11.2003 kell 02:15:
>> Well, actually, the case where it _would_ be troublesome would be
>> where there was a combination of huge tables needing vacuuming and
>> smaller ones that are _heavily_ updated (e.g. - account balances),
>> where pg_autovacuum might take so long on some big tables that it
>> wouldn't get to the smaller ones often enough.
>
> Can't one just run a _separate_ VACUUM on those smaller tables ?

Yes, but that defeats the purpose of having a daemon that tries to
manage this all for you.
--
(reverse (concatenate 'string "gro.gultn" "@" "enworbbc"))
http://www.ntlug.org/~cbbrowne/unix.html
"...once can imagine the government's problem. This is all pretty
magical stuff to them. If I were trying to terminate the operations
of a witch coven, I'd probably seize everything in sight. How would I
tell the ordinary household brooms from the getaway vehicles?"
-- John Perry Barlow


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Christopher Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-03 14:35:57
Message-ID: 3FA667CD.8080000@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Christopher Browne wrote:

> The world rejoiced as hannu(at)tm(dot)ee (Hannu Krosing) wrote:
>> Christopher Browne kirjutas E, 03.11.2003 kell 02:15:
>>> Well, actually, the case where it _would_ be troublesome would be
>>> where there was a combination of huge tables needing vacuuming and
>>> smaller ones that are _heavily_ updated (e.g. - account balances),
>>> where pg_autovacuum might take so long on some big tables that it
>>> wouldn't get to the smaller ones often enough.
>>
>> Can't one just run a _separate_ VACUUM on those smaller tables ?
>
> Yes, but that defeats the purpose of having a daemon that tries to
> manage this all for you.

It only shows where the daemon has potential for improvement. If it
knows approximately the table sizes, it can manage a separate "passing"
lane for the fast and high frequent commuters.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Andrew Sullivan <andrew(at)libertyrms(dot)info>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-03 14:48:45
Message-ID: 20031103144845.GG12457@libertyrms.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 02, 2003 at 01:00:35PM -0500, Tom Lane wrote:
> real traction we'd have to go back to the "take over most of RAM for
> shared buffers" approach, which we already know to have a bunch of
> severe disadvantages.

I know there are severe disadvantages in the current implementation,
but are there in-principle severe disadvantages? Or are you speaking
more generally, like "maintainability of code", "someone has to look
after all that buffering optimisation", "potential for about 10
trillion bugs", &c.?

A

--
----
Andrew Sullivan 204-4141 Yonge Street
Afilias Canada Toronto, Ontario Canada
<andrew(at)libertyrms(dot)info> M2P 2A8
+1 416 646 3304 x110


From: "Matthew T(dot) O'Connor" <matthew(at)zeut(dot)net>
To: Christopher Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-03 14:57:54
Message-ID: 3FA66CF2.6040409@zeut.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Christopher Browne wrote:

>The world rejoiced as hannu(at)tm(dot)ee (Hannu Krosing) wrote:
>
>
>>Christopher Browne kirjutas E, 03.11.2003 kell 02:15:
>>
>>
>>>Well, actually, the case where it _would_ be troublesome would be
>>>where there was a combination of huge tables needing vacuuming and
>>>smaller ones that are _heavily_ updated (e.g. - account balances),
>>>where pg_autovacuum might take so long on some big tables that it
>>>wouldn't get to the smaller ones often enough.
>>>
>>>
>>Can't one just run a _separate_ VACUUM on those smaller tables ?
>>
>>
>
>Yes, but that defeats the purpose of having a daemon that tries to
>manage this all for you.
>
>
But if this delayed vacuum was available for pg_autovacuum to use, it
might be useful for pg_autovacuum to run multiple simultaneous vacuums.
It seems to me that the delayed vacuum is so slow, that we could
probably run several (a few) of them without saturating the I/O.

Or... It seems to me that we have been observing something on the order
of 10x-20x slowdown for vacuuming a table. I think this is WAY
overcompensating for the original problems, and would cause it's own
problem as mentioned above. Since the granularity of delay seems to be
the problem can we do more work between delays? Instead of sleeping
after every page (I assume this is what it's doing) perhaps we should
sleep every 10 pages, or perhaps fix the sleep value at 10ms and make
the amount of work done between sleeps a configurable option. Seems
that would allow small tables to be done without delay etc....


From: "Stephen" <jleelim(at)xxxxxxx(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-03 15:04:52
Message-ID: o7upb.8716$qL7.8160@nntp-post.primus.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I don't mind the long delay as long as we have a choice as we clearly do in
this case to set vacuum_page_delay=WHATEVER. Of course, if VACUUM can be
improved with better code placement for the delays or buffer replacement
policies then I'm all for it. Right now, I'm pretty satisfied with the
responsiveness on large DBs using vacuum_page_delay=10ms delay.

Any ideas if this patch will be included into 7.4 before final release?

Stephen

"Andrew Dunstan" <andrew(at)dunslane(dot)net> wrote in message
news:3FA57532(dot)3020902(at)dunslane(dot)net(dot)(dot)(dot)
>
> Not surprising, I should have thought. Why would you care that much?
> The idea as I understand it is to improve the responsiveness of things
> happening alongside vacuum ("real work"). I normally run vacuum when I
> don't expect anything else much to be happening - but I don't care how
> long it takes (within reason), especially if it isn't going to intefere
> with other uses.
>
> cheers
>
> andrew
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Sullivan <andrew(at)libertyrms(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-03 15:47:04
Message-ID: 21456.1067874424@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Sullivan <andrew(at)libertyrms(dot)info> writes:
> On Sun, Nov 02, 2003 at 01:00:35PM -0500, Tom Lane wrote:
>> real traction we'd have to go back to the "take over most of RAM for
>> shared buffers" approach, which we already know to have a bunch of
>> severe disadvantages.

> I know there are severe disadvantages in the current implementation,
> but are there in-principle severe disadvantages?

Yes. For one, since we cannot change the size of shared memory
on-the-fly (at least not portably), there is no opportunity to trade off
memory usage dynamically between processes and disk buffers. For
another, on many systems shared memory is subject to being swapped out.
Swapping out dirty buffers is a performance killer, because they must be
swapped back in again before they can be written to where they should
have gone. The only way to avoid this is to keep the number of shared
buffers small enough that they all remain fairly "hot" (recently used)
and so the kernel won't be tempted to swap out any part of the region.

regards, tom lane


From: Hannu Krosing <hannu(at)tm(dot)ee>
To: Christopher Browne <cbbrowne(at)acm(dot)org>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-03 16:27:15
Message-ID: 1067876834.2414.12.camel@fuji.krosing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Christopher Browne kirjutas E, 03.11.2003 kell 15:22:
> >
> > Can't one just run a _separate_ VACUUM on those smaller tables ?
>
> Yes, but that defeats the purpose of having a daemon that tries to
> manage this all for you.

If a dumb deamon can't do its work well, we need smarter daemons ;)

---------------
Hannu


From: Christopher Browne <cbbrowne(at)libertyrms(dot)info>
To: pgsql-hackers(at)postgresql(dot)org
Subject: RC1 on AIX - working thus far
Date: 2003-11-03 16:32:49
Message-ID: 60vfq12z0u.fsf_-_@dev6.int.libertyrms.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

... much omitted ...
alter_table ... ok
sequence ... ok
polymorphism ... ok
stats ... ok
============== shutting down postmaster ==============

======================
All 93 tests passed.
======================

rm regress.o
gmake[2]: Leaving directory `/opt/OXRS/PkgSrc/pgsql74-rc1/src/test/regress'
gmake[1]: Leaving directory `/opt/OXRS/PkgSrc/pgsql74-rc1/src/test'
bash-2.05a$ uname -a
AIX ibm-db 1 5 000CD13A4C00
--
"cbbrowne","@","libertyrms.info"
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Christopher Browne <cbbrowne(at)libertyrms(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: RC1 on AIX - working thus far
Date: 2003-11-03 19:25:23
Message-ID: Pine.LNX.4.44.0311032024500.12750-100000@peter.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Christopher Browne writes:

> bash-2.05a$ uname -a
> AIX ibm-db 1 5 000CD13A4C00

We already have a report for AIX. Were you trying to indicate that this
is a different variant thereof?

--
Peter Eisentraut peter_e(at)gmx(dot)net


From: Christopher Browne <cbbrowne(at)libertyrms(dot)info>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: RC1 on AIX - working thus far
Date: 2003-11-03 20:13:25
Message-ID: 60k76h2ot6.fsf@dev6.int.libertyrms.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> Christopher Browne writes:
>
>> bash-2.05a$ uname -a
>> AIX ibm-db 1 5 000CD13A4C00
>
> We already have a report for AIX. Were you trying to indicate that this
> is a different variant thereof?

I'm afraid I hadn't seen another AIX report; this may replicate other reports...

I don't think I have seen an RC1 report on Solaris 8 yet, though I may
be wrong; if there isn't one, here's one...

conversion ... ok
truncate ... ok
alter_table ... ok
sequence ... ok
polymorphism ... ok
stats ... ok
============== shutting down postmaster ==============

======================
All 93 tests passed.
======================

make[2]: Leaving directory `/disk3/OXRS/postgresql-7.4RC1/src/test/regress'
make[1]: Leaving directory `/disk3/OXRS/postgresql-7.4RC1/src/test'
postgres(at)ringo /disk3/OXRS/postgresql-7.4RC1 > uname -a
SunOS ringo 5.8 Generic_108528-17 sun4u sparc SUNW,Ultra-4
postgres(at)ringo /disk3/OXRS/postgresql-7.4RC1 >
--
let name="cbbrowne" and tld="libertyrms.info" in String.concat "@" [name;tld];;
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Christopher Browne <cbbrowne(at)libertyrms(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: RC1 on AIX - working thus far
Date: 2003-11-03 20:36:57
Message-ID: Pine.LNX.4.44.0311032135490.12750-100000@peter.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Christopher Browne writes:

> I'm afraid I hadn't seen another AIX report; this may replicate other reports...

See http://developer.postgresql.org/docs/postgres/supported-platforms.html
for a list of platforms that have been verified with 7.4.
(Linux/Playstation, Linux/hppa, and UnixWare will be added shortly.)

--
Peter Eisentraut peter_e(at)gmx(dot)net


From: Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>
To: Christopher Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 04:07:55
Message-ID: 3FA7261B.1060005@bytecraft.com.my
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Christopher Browne wrote:
> Centuries ago, Nostradamus foresaw when "Stephen" <jleelim(at)xxxxxxx(dot)com> would write:
>
>>As it turns out. With vacuum_page_delay = 0, VACUUM took 1m20s (80s)
>>to complete, with vacuum_page_delay = 1 and vacuum_page_delay = 10,
>>both VACUUMs completed in 18m3s (1080 sec). A factor of 13 times!
>>This is for a single 350 MB table.
>
>
> While it is unfortunate that the minimum quanta seems to commonly be
> 10ms, it doesn't strike me as an enormous difficulty from a practical
> perspective.

If we can't lower the minimum quanta, we could always vacuum 2 pages
before sleeping 10ms, effectively sleeping 5ms.

Say,
vacuum_page_per_delay = 2
vacuum_time_per_delay = 10

What would be interesting would be pg_autovacuum changing these values
per table, depending on current I/O load.

Hmmm. Looks like there's a lot of interesting things pg_autovacuum can do:
1. When on low I/O load, running multiple vacuums on different, smaller
tables on full speed, careful to note that these vacuums will increase
the I/O load as well.
2. When on high I/O load, vacuum big, busy tables slowly.


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>
Cc: Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 04:28:25
Message-ID: 3FA72AE9.9090903@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Ang Chin Han wrote:
> Christopher Browne wrote:
>> Centuries ago, Nostradamus foresaw when "Stephen" <jleelim(at)xxxxxxx(dot)com> would write:
>>
>>>As it turns out. With vacuum_page_delay = 0, VACUUM took 1m20s (80s)
>>>to complete, with vacuum_page_delay = 1 and vacuum_page_delay = 10,
>>>both VACUUMs completed in 18m3s (1080 sec). A factor of 13 times!
>>>This is for a single 350 MB table.
>>
>>
>> While it is unfortunate that the minimum quanta seems to commonly be
>> 10ms, it doesn't strike me as an enormous difficulty from a practical
>> perspective.
>
> If we can't lower the minimum quanta, we could always vacuum 2 pages
> before sleeping 10ms, effectively sleeping 5ms.
>
> Say,
> vacuum_page_per_delay = 2
> vacuum_time_per_delay = 10

That's exactly what I did ... look at the combined experiment posted
under subject "Experimental ARC implementation". The two parameters are
named vacuum_page_groupsize and vacuum_page_delay.

>
> What would be interesting would be pg_autovacuum changing these values
> per table, depending on current I/O load.
>
> Hmmm. Looks like there's a lot of interesting things pg_autovacuum can do:
> 1. When on low I/O load, running multiple vacuums on different, smaller
> tables on full speed, careful to note that these vacuums will increase
> the I/O load as well.
> 2. When on high I/O load, vacuum big, busy tables slowly.
>

From what I see here the two parameters above together with the ARC
scan resistance and with the changed strategy where to place pages
faulted in by vacuum, I think one can pretty good handle that now. It's
certainly much better than before.

What still needs to be addressed is the IO storm cause by checkpoints. I
see it much relaxed when stretching out the BufferSync() over most of
the time until the next one should occur. But the kernel sync at it's
end still pushes the system hard against the wall.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 15:31:39
Message-ID: 22099.1067959899@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> What still needs to be addressed is the IO storm cause by checkpoints. I
> see it much relaxed when stretching out the BufferSync() over most of
> the time until the next one should occur. But the kernel sync at it's
> end still pushes the system hard against the wall.

I have never been happy with the fact that we use sync(2) at all. Quite
aside from the "I/O storm" issue, sync() is really an unsafe way to do a
checkpoint, because there is no way to be certain when it is done. And
on top of that, it does too much, because it forces syncing of files
unrelated to Postgres.

I would like to see us go over to fsync, or some other technique that
gives more certainty about when the write has occurred. There might be
some scope that way to allow stretching out the I/O, too.

The main problem with this is knowing which files need to be fsync'd.
The only idea I have come up with is to move all buffer write operations
into a background writer process, which could easily keep track of
every file it's written into since the last checkpoint. This could cause
problems though if a backend wants to acquire a free buffer and there's
none to be had --- do we want it to wait for the background process to
do something? We could possibly say that backends may write dirty
buffers for themselves, but only if they fsync them immediately. As
long as this path is seldom taken, the extra fsyncs shouldn't be a big
performance problem.

Actually, once you build it this way, you could make all writes
synchronous (open the files O_SYNC) so that there is never any need for
explicit fsync at checkpoint time. The background writer process would
be the one incurring the wait in most cases, and that's just fine. In
this way you could directly control the rate at which writes are issued,
and there's no I/O storm at all. (fsync could still cause an I/O storm
if there's lots of pending writes in a single file.)

regards, tom lane


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 15:45:22
Message-ID: 3FA7C992.5090106@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>> What still needs to be addressed is the IO storm cause by checkpoints. I
>> see it much relaxed when stretching out the BufferSync() over most of
>> the time until the next one should occur. But the kernel sync at it's
>> end still pushes the system hard against the wall.
>
> I have never been happy with the fact that we use sync(2) at all. Quite
> aside from the "I/O storm" issue, sync() is really an unsafe way to do a
> checkpoint, because there is no way to be certain when it is done. And
> on top of that, it does too much, because it forces syncing of files
> unrelated to Postgres.

Sure does it do too much. But together with the other layer of
indirection, the virtual file descriptor pool, what is the exact
guaranteed behaviour of

write(); close(); open(); fsync();

cross platform?

> Actually, once you build it this way, you could make all writes
> synchronous (open the files O_SYNC) so that there is never any need for
> explicit fsync at checkpoint time. The background writer process would
> be the one incurring the wait in most cases, and that's just fine. In
> this way you could directly control the rate at which writes are issued,
> and there's no I/O storm at all. (fsync could still cause an I/O storm
> if there's lots of pending writes in a single file.)

Yes, but then the configuration leans more towards "take over the RAM"
again, and we better have a much improved cache strategy before that.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Postgresql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 15:51:02
Message-ID: 3FA7CAE6.1040402@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

>Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>
>
>>What still needs to be addressed is the IO storm cause by checkpoints. I
>>see it much relaxed when stretching out the BufferSync() over most of
>>the time until the next one should occur. But the kernel sync at it's
>>end still pushes the system hard against the wall.
>>
>>
>
>I have never been happy with the fact that we use sync(2) at all. Quite
>aside from the "I/O storm" issue, sync() is really an unsafe way to do a
>checkpoint, because there is no way to be certain when it is done. And
>on top of that, it does too much, because it forces syncing of files
>unrelated to Postgres.
>
>I would like to see us go over to fsync, or some other technique that
>gives more certainty about when the write has occurred. There might be
>some scope that way to allow stretching out the I/O, too.
>
>The main problem with this is knowing which files need to be fsync'd.
>The only idea I have come up with is to move all buffer write operations
>into a background writer process, which could easily keep track of
>every file it's written into since the last checkpoint. This could cause
>problems though if a backend wants to acquire a free buffer and there's
>none to be had --- do we want it to wait for the background process to
>do something? We could possibly say that backends may write dirty
>buffers for themselves, but only if they fsync them immediately. As
>long as this path is seldom taken, the extra fsyncs shouldn't be a big
>performance problem.
>
>Actually, once you build it this way, you could make all writes
>synchronous (open the files O_SYNC) so that there is never any need for
>explicit fsync at checkpoint time. The background writer process would
>be the one incurring the wait in most cases, and that's just fine. In
>this way you could directly control the rate at which writes are issued,
>and there's no I/O storm at all. (fsync could still cause an I/O storm
>if there's lots of pending writes in a single file.)
>
>
>
Or maybe fdatasync() would be slightly more efficient - do we care about
flushing metadata that much?

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 15:58:46
Message-ID: 22408.1067961526@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> Tom Lane wrote:
>> I have never been happy with the fact that we use sync(2) at all.

> Sure does it do too much. But together with the other layer of
> indirection, the virtual file descriptor pool, what is the exact
> guaranteed behaviour of
> write(); close(); open(); fsync();
> cross platform?

That isn't guaranteed, which is why we have to use sync() at the
moment. To go over to fsync or O_SYNC we'd need more control over which
file descriptors are used to issue writes. Which is why I was thinking
about moving the writes to a centralized writer process.

>> Actually, once you build it this way, you could make all writes
>> synchronous (open the files O_SYNC) so that there is never any need for
>> explicit fsync at checkpoint time.

> Yes, but then the configuration leans more towards "take over the RAM"

Why? The idea is to try to issue writes at a fairly steady rate, which
strikes me as much better than the current behavior. I don't see why it
would force you to have large numbers of buffers available. You'd want
a few thousand, no doubt, but that's not a large number.

regards, tom lane


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 16:28:10
Message-ID: 3FA7D39A.8060502@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>> Tom Lane wrote:
>>> I have never been happy with the fact that we use sync(2) at all.
>
>> Sure does it do too much. But together with the other layer of
>> indirection, the virtual file descriptor pool, what is the exact
>> guaranteed behaviour of
>> write(); close(); open(); fsync();
>> cross platform?
>
> That isn't guaranteed, which is why we have to use sync() at the
> moment. To go over to fsync or O_SYNC we'd need more control over which
> file descriptors are used to issue writes. Which is why I was thinking
> about moving the writes to a centralized writer process.
>
>>> Actually, once you build it this way, you could make all writes
>>> synchronous (open the files O_SYNC) so that there is never any need for
>>> explicit fsync at checkpoint time.
>
>> Yes, but then the configuration leans more towards "take over the RAM"
>
> Why? The idea is to try to issue writes at a fairly steady rate, which
> strikes me as much better than the current behavior. I don't see why it
> would force you to have large numbers of buffers available. You'd want
> a few thousand, no doubt, but that's not a large number.

That is part of the idea. The whole idea is to issue "physical" writes
at a fairly steady rate without increasing the number of them
substantial or interfering with the drives opinion about their order too
much. I think O_SYNC for random access can be in conflict with write
reordering.

How I can see the background writer operating is that he's keeping the
buffers in the order of the LRU chain(s) clean, because those are the
buffers that most likely get replaced soon. In my experimental ARC code
it would traverse the T1 and T2 queues from LRU to MRU, write out n1 and
n2 dirty buffers (n1+n2 configurable), then fsync all files that have
been involved in that, nap depending on where he got down the queues (to
increase the write rate when running low on clean buffers), and do it
all over again.

That way, everyone else doing a write must issue an fsync too because
it's not guaranteed that the fsync of one process flushes the writes of
another. But as you said, if that is a relatively seldom operation for a
regular backend, it won't hurt.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: "Matthew T(dot) O'Connor" <matthew(at)zeut(dot)net>
To: Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>
Cc: Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 16:30:40
Message-ID: 3FA7D430.201@zeut.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Ang Chin Han wrote:

> Christopher Browne wrote:
>
>> Centuries ago, Nostradamus foresaw when "Stephen"
>> <jleelim(at)xxxxxxx(dot)com> would write:
>>
>>> As it turns out. With vacuum_page_delay = 0, VACUUM took 1m20s (80s)
>>> to complete, with vacuum_page_delay = 1 and vacuum_page_delay = 10,
>>> both VACUUMs completed in 18m3s (1080 sec). A factor of 13 times!
>>> This is for a single 350 MB table.
>>
>> While it is unfortunate that the minimum quanta seems to commonly be
>> 10ms, it doesn't strike me as an enormous difficulty from a practical
>> perspective.
>
> If we can't lower the minimum quanta, we could always vacuum 2 pages
> before sleeping 10ms, effectively sleeping 5ms.

Right, I think this is what Jan has done already.

> What would be interesting would be pg_autovacuum changing these values
> per table, depending on current I/O load.
>
> Hmmm. Looks like there's a lot of interesting things pg_autovacuum can
> do:
> 1. When on low I/O load, running multiple vacuums on different,
> smaller tables on full speed, careful to note that these vacuums will
> increase the I/O load as well.
> 2. When on high I/O load, vacuum big, busy tables slowly.

I'm not sure how practacle any of this is. How will pg_autovacuum
surmise the current I/O load of the system? Keeping in mind that
postgres is not the only cause of I/O. Also, the optimum delay for a
long running vacuum might change dramatically while it's running. If
there is a way to judge the current I/O load, it might be better for
vacuum to auto-tune itself while it's running, perhaps based on some
hints given to it by pg_autovacuum or manually by a user. For example,
a delay hint of 0 should always be zero no matter what. A delay hint of
1 will scale up slower than a delay hint of 2 which would scale up
slower than 5 etc....

Of course this is all wild conjecture if there isn't an easy way to
surmise the system I/O load. Thoughts?


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 16:49:03
Message-ID: 22712.1067964543@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> That is part of the idea. The whole idea is to issue "physical" writes
> at a fairly steady rate without increasing the number of them
> substantial or interfering with the drives opinion about their order too
> much. I think O_SYNC for random access can be in conflict with write
> reordering.

Good point. But if we issue lots of writes without fsync then we still
have the problem of a write storm when the fsync finally occurs, while
if we fsync too often then we constrain the write order too much. There
will need to be some tuning here.

> How I can see the background writer operating is that he's keeping the
> buffers in the order of the LRU chain(s) clean, because those are the
> buffers that most likely get replaced soon. In my experimental ARC code
> it would traverse the T1 and T2 queues from LRU to MRU, write out n1 and
> n2 dirty buffers (n1+n2 configurable), then fsync all files that have
> been involved in that, nap depending on where he got down the queues (to
> increase the write rate when running low on clean buffers), and do it
> all over again.

You probably need one more knob here: how often to issue the fsyncs.
I'm not convinced "once per outer loop" is a sufficient answer.
Otherwise this is sounding pretty good.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Postgresql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 17:11:16
Message-ID: 22927.1067965876@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> Actually, once you build it this way, you could make all writes
>> synchronous (open the files O_SYNC) so that there is never any need for
>> explicit fsync at checkpoint time.
>>
> Or maybe fdatasync() would be slightly more efficient - do we care about
> flushing metadata that much?

We don't, but it would just obscure the discussion to spell out "fsync,
or fdatasync where available" ...

regards, tom lane


From: Christopher Browne <cbbrowne(at)libertyrms(dot)info>
To: pgsql-hackers(at)postgresql(dot)org
Subject: RC1 on AIX - Some Anti-results
Date: 2003-11-04 17:32:29
Message-ID: 6065i011le.fsf_-_@dev6.int.libertyrms.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

peter_e(at)gmx(dot)net (Peter Eisentraut) writes:
> Christopher Browne writes:
>
>> bash-2.05a$ uname -a
>> AIX ibm-db 1 5 000CD13A4C00
>
> We already have a report for AIX. Were you trying to indicate that this
> is a different variant thereof?

Actually, after some more work, there's an anomaly when compiling RC1
with VisualAge C. (I'm not sure if it bit earlier 7.4 releases; this
is the first time I have tried 7.4 with VAC/VACPP. There were no
issues when I compiled 7.3.4 with VAC/VACPP.)

'CC=/usr/vac/bin/xlc'

bash-2.05a$ more src/test/regress/regression.diffs
*** ./expected/geometry.out Fri Oct 31 22:07:07 2003
--- ./results/geometry.out Tue Nov 4 13:09:02 2003
***************
*** 117,123 ****
| (5.1,34.5) | [(1,2),(3,4)] | (3,4)
| (-5,-12) | [(1,2),(3,4)] | (1,2)
| (10,10) | [(1,2),(3,4)] | (3,4)
! | (0,0) | [(0,0),(6,6)] | (-0,0)
| (-10,0) | [(0,0),(6,6)] | (0,0)
| (-3,4) | [(0,0),(6,6)] | (0.5,0.5)
| (5.1,34.5) | [(0,0),(6,6)] | (6,6)
--- 117,123 ----
| (5.1,34.5) | [(1,2),(3,4)] | (3,4)
| (-5,-12) | [(1,2),(3,4)] | (1,2)
| (10,10) | [(1,2),(3,4)] | (3,4)
! | (0,0) | [(0,0),(6,6)] | (0,0)
| (-10,0) | [(0,0),(6,6)] | (0,0)
| (-3,4) | [(0,0),(6,6)] | (0.5,0.5)
| (5.1,34.5) | [(0,0),(6,6)] | (6,6)

======================================================================

So long as we're not expecting integrability of anything, I'm game for
-0 to be treated as mostly-equivalent to 0.

With VisualAge C++ *'CC=/usr/vacpp/bin/xlc'), I see the very same
result as with VAC.

This _seems_ a cosmetic difference, or am I way wrong?
--
select 'cbbrowne' || '@' || 'libertyrms.info';
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)


From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 17:52:13
Message-ID: 8765i058du.fsf@stark.dyndns.tv
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:

> > vacuum_page_per_delay = 2
> > vacuum_time_per_delay = 10
>
> That's exactly what I did ... look at the combined experiment posted under
> subject "Experimental ARC implementation". The two parameters are named
> vacuum_page_groupsize and vacuum_page_delay.

FWIW this seems like a good idea for other reasons too, the hard drive and the
kernel are going to read multiple sequential blocks anyways whether you sleep
on them or not. Better to read enough blocks to take advantage of the
readahead without saturating the drive, then sleep to let those buffers age
out. If you read one block then sleep the buffers of readahead may get aged
out and have to be fetched again, which would actually increase the amount of
i/o bandwidth used.

I would expect much higher vacuum_page_per_delay's would probably not have a
noticable effect and be much faster. Something like

vacuum_page_per_delay = 128
vacuum_time_per_delay = 100

Or more likely, something in-between.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Christopher Browne <cbbrowne(at)libertyrms(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: RC1 on AIX - Some Anti-results
Date: 2003-11-04 18:24:12
Message-ID: 23493.1067970252@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Christopher Browne <cbbrowne(at)libertyrms(dot)info> writes:
> This _seems_ a cosmetic difference, or am I way wrong?

I think you can ignore it. It's odd that your setup seems to support
minus zero (else there'd be more diffs) but doesn't get the right answer
for this single computation. Still, it's basically a roundoff issue,
and as such a legitimate platform-specific behavior.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 18:28:58
Message-ID: 87znfc3s45.fsf@stark.dyndns.tv
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> I would like to see us go over to fsync, or some other technique that
> gives more certainty about when the write has occurred. There might be
> some scope that way to allow stretching out the I/O, too.
>
> The main problem with this is knowing which files need to be fsync'd.

Why could the postmaster not just fsync *every* file? Does any OS make it a
slow operation to fsync a file that has no pending writes? Would we even care,
it would mean the checkpoint would take longer but not actually issue any
extra i/o.

I'm assuming fsync syncs writes issued by other processes on the same file,
which isn't necessarily true though. Otherwise every process would have to
fsync every file descriptor it has open.

> The only idea I have come up with is to move all buffer write operations
> into a background writer process, which could easily keep track of
> every file it's written into since the last checkpoint.

I fear this approach. It seems to limit a lot of design flexibility later. But
I can't come up with any concrete way it limits things so perhaps that
instinct is just fud.

It also can become a point of contention. At least on Oracle you often need
multiple such processes to keep up with the i/o bandwidth.

> Actually, once you build it this way, you could make all writes synchronous
> (open the files O_SYNC) so that there is never any need for explicit fsync
> at checkpoint time.

Or using aio write ahead as much as you want and then just make checkpoint
block until all the writes are completed. You don't actually need to rush them
at all, just know when they're done. That would completely eliminate the i/o
storm without changing the actual pattern of writes at all.

--
greg


From: "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 20:41:30
Message-ID: Pine.LNX.4.33.0311041340410.9104-100000@css120.ihs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 4 Nov 2003, Tom Lane wrote:

> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> > What still needs to be addressed is the IO storm cause by checkpoints. I
> > see it much relaxed when stretching out the BufferSync() over most of
> > the time until the next one should occur. But the kernel sync at it's
> > end still pushes the system hard against the wall.
>
> I have never been happy with the fact that we use sync(2) at all. Quite
> aside from the "I/O storm" issue, sync() is really an unsafe way to do a
> checkpoint, because there is no way to be certain when it is done. And
> on top of that, it does too much, because it forces syncing of files
> unrelated to Postgres.
>
> I would like to see us go over to fsync, or some other technique that
> gives more certainty about when the write has occurred. There might be
> some scope that way to allow stretching out the I/O, too.
>
> The main problem with this is knowing which files need to be fsync'd.

Wasn't this a problem that the win32 port had to solve by keeping a list
of all files that need fsyncing since Windows doesn't do sync() in the
classical sense? If so, then could we use that code to keep track of the
files that need fsyncing?


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 21:10:53
Message-ID: 24941.1067980253@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> The main problem with this is knowing which files need to be fsync'd.

> Why could the postmaster not just fsync *every* file?

You want to find, open, and fsync() every file in the database cluster
for every checkpoint? Sounds like a non-starter to me. In typical
situations I'd expect there to be lots of files that have no writes
during any given checkpoint interval (system catalogs for instance).

> I'm assuming fsync syncs writes issued by other processes on the same file,
> which isn't necessarily true though.

It was already pointed out that we can't rely on that assumption.

> Or using aio write ahead as much as you want and then just make checkpoint
> block until all the writes are completed. You don't actually need to rush them
> at all, just know when they're done.

If the objective is to avoid an i/o storm, I don't think this does it.
The system could easily delay most of the writes until the next syncer()
pass.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Postgresql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 21:17:39
Message-ID: 3FA81773.9080204@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

scott.marlowe wrote:

>On Tue, 4 Nov 2003, Tom Lane wrote:
>
>
>>The main problem with this is knowing which files need to be fsync'd.
>>
>>
>
>Wasn't this a problem that the win32 port had to solve by keeping a list
>of all files that need fsyncing since Windows doesn't do sync() in the
>classical sense? If so, then could we use that code to keep track of the
>files that need fsyncing?
>
>
>

according to the win32 page at
http://momjian.postgresql.org/main/writings/pgsql/win32.html this is
still to be done. I seem to recall Bruce saying that SRA had found the
solution to this, something along these lines, but I am not sure the
code is there yet (don't have access to that branch on this machine)

cheers

andrew


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-04 21:53:32
Message-ID: 3FA81FDC.9030508@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>
>> How I can see the background writer operating is that he's keeping the
>> buffers in the order of the LRU chain(s) clean, because those are the
>> buffers that most likely get replaced soon. In my experimental ARC code
>> it would traverse the T1 and T2 queues from LRU to MRU, write out n1 and
>> n2 dirty buffers (n1+n2 configurable), then fsync all files that have
>> been involved in that, nap depending on where he got down the queues (to
>> increase the write rate when running low on clean buffers), and do it
>> all over again.
>
> You probably need one more knob here: how often to issue the fsyncs.
> I'm not convinced "once per outer loop" is a sufficient answer.
> Otherwise this is sounding pretty good.

This is definitely heading into the right direction.

I currently have a crude and ugly hacked system, that does checkpoints
every minute but streches them out over the whole time. It writes out
the dirty buffers in T1+T2 LRU order intermixed, streches out the flush
over the whole checkpoint interval and does sync()+usleep() every 32
blocks (if it has time to do this).

This is clearly the wrong way to implement it, but ...

The same system has ARC and delayed vacuum. With normal, unmodified
checkpoints every 300 seconds, the transaction responsetime for
new_order still peaks at over 30 seconds (5 is already too much) so the
system basically come to a freeze during a checkpoint.

Now with this high-frequent sync()ing and checkpointing by the minute,
the entire system load levels out really nice. Basically it's constantly
checkpointing. So maybe the thing we're looking for is to make the
checkpoint process the background buffer writer process and let it
checkpoint 'round the clock. Of course, with a bit more selectivity on
what to fsync and not doing system wide sync() every 10-500 milliseconds :-)

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-05 01:26:02
Message-ID: 87islz4ndh.fsf@stark.dyndns.tv
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> >> The main problem with this is knowing which files need to be fsync'd.
>
> > Why could the postmaster not just fsync *every* file?
>
> You want to find, open, and fsync() every file in the database cluster
> for every checkpoint? Sounds like a non-starter to me. In typical
> situations I'd expect there to be lots of files that have no writes
> during any given checkpoint interval (system catalogs for instance).

Except a) this is outside any critical path, and b) only done every few
minutes and c) the fsync calls on files with no dirty buffers ought to be
cheap, at least as far as i/o.

So even a few hundred extra open/fsync/close syscalls per minute wouldn't
really cause any extra i/o and wouldn't be happening frequently enough to use
any noticeable cpu.

> > I'm assuming fsync syncs writes issued by other processes on the same file,
> > which isn't necessarily true though.
>
> It was already pointed out that we can't rely on that assumption.

So the NetBSD and Sun developers I checked with both asserted fsync does in
fact guarantee this. And SUSv2 seems to back them up:

The fsync() function can be used by an application to indicate that all
data for the open file description named by fildes is to be transferred to
the storage device associated with the file described by fildes in an
implementation-dependent manner. The fsync() function does not return
until the system has completed that action or until an error is detected.

http://www.opengroup.org/onlinepubs/007908799/xsh/fsync.html

--
greg


From: Manfred Spraul <manfred(at)colorfullife(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-05 06:00:22
Message-ID: 3FA891F6.7080406@colorfullife.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark wrote:

>>>I'm assuming fsync syncs writes issued by other processes on the same file,
>>>which isn't necessarily true though.
>>>
>>>
>>It was already pointed out that we can't rely on that assumption.
>>
>>
>
>So the NetBSD and Sun developers I checked with both asserted fsync does in
>fact guarantee this. And SUSv2 seems to back them up:
>
>
At least Linux had one problem: fsync() syncs the inode to disk, but not
the directory entry: if you rename a file, open it, write to it, fsync,
and the computer crashes, then it's not guaranteed that the file rename
is on the disk.
I think only the old ext2 is affected, not the journaling filesystems.

--
Manfred


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Manfred Spraul <manfred(at)colorfullife(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-05 06:06:09
Message-ID: 877k2f4aem.fsf@stark.dyndns.tv
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Manfred Spraul <manfred(at)colorfullife(dot)com> writes:

> Greg Stark wrote:
>
> >>>I'm assuming fsync syncs writes issued by other processes on the same file,
> >>>which isn't necessarily true though.
> >>>
> >>It was already pointed out that we can't rely on that assumption.
> >>
> >
> >So the NetBSD and Sun developers I checked with both asserted fsync does in
> >fact guarantee this. And SUSv2 seems to back them up:
> >

> At least Linux had one problem: fsync() syncs the inode to disk, but not the
> directory entry: if you rename a file, open it, write to it, fsync, and the
> computer crashes, then it's not guaranteed that the file rename is on the disk.
> I think only the old ext2 is affected, not the journaling filesystems.

That's true. But why would postgres ever have to worry about files being
renamed being synced? Tables and indexes don't get their files renamed
typically. WAL logs?

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-05 15:09:01
Message-ID: 1969.1068044941@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> You want to find, open, and fsync() every file in the database cluster
>> for every checkpoint? Sounds like a non-starter to me.

> Except a) this is outside any critical path, and b) only done every few
> minutes and c) the fsync calls on files with no dirty buffers ought to be
> cheap, at least as far as i/o.

The directory search and opening of the files is in itself nontrivial
overhead ... particularly on systems where open(2) isn't speedy, such
as Solaris. I also disbelieve your assumption that fsync'ing a file
that doesn't need it will be free. That depends entirely on what sort
of indexes the OS keeps on its buffer cache. There are Unixen where
fsync requires a scan through the entire buffer cache because there is
no data structure that permits finding associated buffers any more
efficiently than that. (IIRC, the HPUX system I'm typing this on is
like that.) On those sorts of systems, we'd be way better off to use
O_SYNC or O_DSYNC on all our writes than to invoke multiple fsyncs.
Check the archives --- this was all gone into in great detail when we
were testing alternative methods for fsyncing the WAL files.

> So the NetBSD and Sun developers I checked with both asserted fsync does in
> fact guarantee this. And SUSv2 seems to back them up:

> The fsync() function can be used by an application to indicate that all
> data for the open file description named by fildes is to be transferred to
> the storage device associated with the file described by fildes in an
> implementation-dependent manner.

The question here is what is meant by "data for the open file
description". If it said "all data for the file referenced by the open
FD" then I would agree that the spec says what you claim. As is, I
think it would be entirely within the spec for the OS to dump only
buffers that had been dirtied through that particular FD. Notice that
the last part of the sentence is careful to respect the distinction
between the FD and the file; why isn't the first part?

regards, tom lane


From: "Stephen" <jleelim(at)xxxxxxx(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-05 17:24:56
Message-ID: Qmaqb.13507$GN3.10724@nntp-post.primus.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The delay patch worked so well, I couldn't resist asking if a similar patch
could be added for COPY command (pg_dump). It's just an extension of the
same idea. On a large DB, backups can take very long while consuming a lot
of IO slowing down other select and write operations. We operate on a backup
window during low traffic period at night. It'll be nice to be able to run
pg_dump *anytime* and no longer need to worry about the backup window.
Backups will take longer to run, but like in the case of the VACUUM, it's a
win for many people to be able to let it run in the background through the
whole day. The delay should be optional and defaults to zero so those who
wish to backup immediately can still do it. The way I see it, routine
backups and vacuums should be ubiquitous once properly configured.

Regards,

Stephen

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in message
news:15456(dot)1067796035(at)sss(dot)pgh(dot)pa(dot)us(dot)(dot)(dot)
> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> > I am currently looking at implementing ARC as a replacement strategy. I
> > don't have anything that works yet, so I can't really tell what the
> > result would be and it might turn out that we want both features.
>
> It's likely that we would. As someone (you?) already pointed out,
> VACUUM has bad side-effects both in terms of cache flushing and in
> terms of sheer I/O load. Those effects require different fixes AFAICS.
>
> One thing that bothers me here is that I don't see how adjusting our
> own buffer replacement strategy is going to do much of anything when
> we cannot control the kernel's buffer replacement strategy. To get any
> real traction we'd have to go back to the "take over most of RAM for
> shared buffers" approach, which we already know to have a bunch of
> severe disadvantages.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 04:07:12
Message-ID: 200311100407.hAA47Cp22890@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> >> You want to find, open, and fsync() every file in the database cluster
> >> for every checkpoint? Sounds like a non-starter to me.
>
> > Except a) this is outside any critical path, and b) only done every few
> > minutes and c) the fsync calls on files with no dirty buffers ought to be
> > cheap, at least as far as i/o.
>
> The directory search and opening of the files is in itself nontrivial
> overhead ... particularly on systems where open(2) isn't speedy, such
> as Solaris. I also disbelieve your assumption that fsync'ing a file
> that doesn't need it will be free. That depends entirely on what sort
> of indexes the OS keeps on its buffer cache. There are Unixen where
> fsync requires a scan through the entire buffer cache because there is
> no data structure that permits finding associated buffers any more
> efficiently than that. (IIRC, the HPUX system I'm typing this on is
> like that.) On those sorts of systems, we'd be way better off to use
> O_SYNC or O_DSYNC on all our writes than to invoke multiple fsyncs.
> Check the archives --- this was all gone into in great detail when we
> were testing alternative methods for fsyncing the WAL files.

Not sure on this one --- let's look at our options
O_SYNC
fsync
sync

Now, O_SYNC is going to force every write to the disk. If we have a
transaction that has to write lots of buffers (has to write them to
reuse the shared buffer), it will have to wait for every buffer to hit
disk before the write returns --- this seems terrible to me and gives
the drive no way to group adjacent writes. Even on HPUX, which has poor
fsync dirty buffer detection, if the fsync is outside the main
processing loop (checkpoint process), isn't fsync better than O_SYNC?
Now, if we are sure that writes will happen only in the checkpoint
process, O_SYNC would be OK, I guess, but will we ever be sure of that?
I can't imagine a checkpoint process keeping up with lots of active
backends, especially if the writes use O_SYNC. The problem is that
instead of having backend write everything to kernel buffers, we are all
of a sudden forcing all writes of dirty buffers to disk. sync() starts
to look very attractive compared to that option.

fsync is better in that we can force it after a number of writes, and
can delay it, so we can write a buffer and reuse it, then later issue
the fsync. That is a win, though it doesn't allow the drive to group
adjacent writes in different files. Sync of course allows grouping of
all writes by the drive, but writes all non-PostgreSQL dirty buffers
too. Ideally, we would have an fsync() where we could pass it a list of
our files and it would do all of them optimally.

From what I have heard so far, sync() still seems like the most
efficient method. I know it only schedules write, but with a sleep
after it, it seems like maybe the best bet.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jan Wieck <JanWieck(at)yahoo(dot)com>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 04:14:20
Message-ID: 200311100414.hAA4EKu23543@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> > What still needs to be addressed is the IO storm cause by checkpoints. I
> > see it much relaxed when stretching out the BufferSync() over most of
> > the time until the next one should occur. But the kernel sync at it's
> > end still pushes the system hard against the wall.
>
> I have never been happy with the fact that we use sync(2) at all. Quite
> aside from the "I/O storm" issue, sync() is really an unsafe way to do a
> checkpoint, because there is no way to be certain when it is done. And
> on top of that, it does too much, because it forces syncing of files
> unrelated to Postgres.
>
> I would like to see us go over to fsync, or some other technique that
> gives more certainty about when the write has occurred. There might be
> some scope that way to allow stretching out the I/O, too.
>
> The main problem with this is knowing which files need to be fsync'd.
> The only idea I have come up with is to move all buffer write operations
> into a background writer process, which could easily keep track of
> every file it's written into since the last checkpoint. This could cause
> problems though if a backend wants to acquire a free buffer and there's
> none to be had --- do we want it to wait for the background process to
> do something? We could possibly say that backends may write dirty
> buffers for themselves, but only if they fsync them immediately. As
> long as this path is seldom taken, the extra fsyncs shouldn't be a big
> performance problem.
>
> Actually, once you build it this way, you could make all writes
> synchronous (open the files O_SYNC) so that there is never any need for
> explicit fsync at checkpoint time. The background writer process would
> be the one incurring the wait in most cases, and that's just fine. In
> this way you could directly control the rate at which writes are issued,
> and there's no I/O storm at all. (fsync could still cause an I/O storm
> if there's lots of pending writes in a single file.)

This outlines the same issue --- a very active backend might dirty 5k
buffers --- if those 5k buffers have to be written using O_SYNC, it will
take much longer than doing 5k buffer writes and doing an fsync() or
sync() at the end.

Having another process do the writing does allow some paralellism, but
people don't seem to care of buffers having to be read in from the
kernel buffer cache, so what big benefit do we get by having someone
else write into the kernel buffer cache, except allowing a central place
to fsync, and is it worth it considering that it might be impossible to
configure a system where the writer process can keep up with all the
backends?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jan Wieck <JanWieck(at)yahoo(dot)com>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 04:18:33
Message-ID: 200311100418.hAA4IXf23844@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

scott.marlowe wrote:
> On Tue, 4 Nov 2003, Tom Lane wrote:
>
> > Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> > > What still needs to be addressed is the IO storm cause by checkpoints. I
> > > see it much relaxed when stretching out the BufferSync() over most of
> > > the time until the next one should occur. But the kernel sync at it's
> > > end still pushes the system hard against the wall.
> >
> > I have never been happy with the fact that we use sync(2) at all. Quite
> > aside from the "I/O storm" issue, sync() is really an unsafe way to do a
> > checkpoint, because there is no way to be certain when it is done. And
> > on top of that, it does too much, because it forces syncing of files
> > unrelated to Postgres.
> >
> > I would like to see us go over to fsync, or some other technique that
> > gives more certainty about when the write has occurred. There might be
> > some scope that way to allow stretching out the I/O, too.
> >
> > The main problem with this is knowing which files need to be fsync'd.
>
> Wasn't this a problem that the win32 port had to solve by keeping a list
> of all files that need fsyncing since Windows doesn't do sync() in the
> classical sense? If so, then could we use that code to keep track of the
> files that need fsyncing?

Yes, I have that code from SRA. They used threading, so they recorded
all the open files in local memory and opened/fsync/closed them for
checkpoints. We have to store the file names in a shared area, perhaps
an area of shared memory with an overflow to a disk file.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)yahoo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 04:31:39
Message-ID: 200311100431.hAA4Vdf29615@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


I would be interested to know if you have the background write process
writing old dirty buffers to kernel buffers continually if the sync()
load is diminished. What this does is to push more dirty buffers into
the kernel cache in hopes the OS will write those buffers on its own
before the checkpoint does its write/sync work. This might allow us to
reduce sync() load while preventing the need for O_SYNC/fsync().

Perhaps sync() is bad partly because the checkpoint runs through all the
dirty shared buffers and writes them all to the kernel and then issues
sync() almost guaranteeing a flood of writes to the disk. This method
would find fewer dirty buffers in the shared buffer cache, and therefore
fewer kernel writes needed by sync().

---------------------------------------------------------------------------

Jan Wieck wrote:
> Tom Lane wrote:
>
> > Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> >
> >> How I can see the background writer operating is that he's keeping the
> >> buffers in the order of the LRU chain(s) clean, because those are the
> >> buffers that most likely get replaced soon. In my experimental ARC code
> >> it would traverse the T1 and T2 queues from LRU to MRU, write out n1 and
> >> n2 dirty buffers (n1+n2 configurable), then fsync all files that have
> >> been involved in that, nap depending on where he got down the queues (to
> >> increase the write rate when running low on clean buffers), and do it
> >> all over again.
> >
> > You probably need one more knob here: how often to issue the fsyncs.
> > I'm not convinced "once per outer loop" is a sufficient answer.
> > Otherwise this is sounding pretty good.
>
> This is definitely heading into the right direction.
>
> I currently have a crude and ugly hacked system, that does checkpoints
> every minute but streches them out over the whole time. It writes out
> the dirty buffers in T1+T2 LRU order intermixed, streches out the flush
> over the whole checkpoint interval and does sync()+usleep() every 32
> blocks (if it has time to do this).
>
> This is clearly the wrong way to implement it, but ...
>
> The same system has ARC and delayed vacuum. With normal, unmodified
> checkpoints every 300 seconds, the transaction responsetime for
> new_order still peaks at over 30 seconds (5 is already too much) so the
> system basically come to a freeze during a checkpoint.
>
> Now with this high-frequent sync()ing and checkpointing by the minute,
> the entire system load levels out really nice. Basically it's constantly
> checkpointing. So maybe the thing we're looking for is to make the
> checkpoint process the background buffer writer process and let it
> checkpoint 'round the clock. Of course, with a bit more selectivity on
> what to fsync and not doing system wide sync() every 10-500 milliseconds :-)
>
>
> Jan
>
> --
> #======================================================================#
> # It's easier to get forgiveness for being wrong than for being right. #
> # Let's break this rule - forgive me. #
> #================================================== JanWieck(at)Yahoo(dot)com #
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jan Wieck <JanWieck(at)yahoo(dot)com>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 04:34:16
Message-ID: 200311100434.hAA4YGk00131@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> > That is part of the idea. The whole idea is to issue "physical" writes
> > at a fairly steady rate without increasing the number of them
> > substantial or interfering with the drives opinion about their order too
> > much. I think O_SYNC for random access can be in conflict with write
> > reordering.
>
> Good point. But if we issue lots of writes without fsync then we still
> have the problem of a write storm when the fsync finally occurs, while
> if we fsync too often then we constrain the write order too much. There
> will need to be some tuning here.

I know the BSD's have trickle sync --- if we write the dirty buffers to
kernel buffers many seconds before our checkpoint, the kernel might
right them to disk for use and sync() will not need to do it.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Sullivan <andrew(at)libertyrms(dot)info>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 04:48:49
Message-ID: 200311100448.hAA4mnT01777@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Andrew Sullivan <andrew(at)libertyrms(dot)info> writes:
> > On Sun, Nov 02, 2003 at 01:00:35PM -0500, Tom Lane wrote:
> >> real traction we'd have to go back to the "take over most of RAM for
> >> shared buffers" approach, which we already know to have a bunch of
> >> severe disadvantages.
>
> > I know there are severe disadvantages in the current implementation,
> > but are there in-principle severe disadvantages?
>
> Yes. For one, since we cannot change the size of shared memory
> on-the-fly (at least not portably), there is no opportunity to trade off
> memory usage dynamically between processes and disk buffers. For
> another, on many systems shared memory is subject to being swapped out.
> Swapping out dirty buffers is a performance killer, because they must be
> swapped back in again before they can be written to where they should
> have gone. The only way to avoid this is to keep the number of shared
> buffers small enough that they all remain fairly "hot" (recently used)
> and so the kernel won't be tempted to swap out any part of the region.

Agreed, we can't resize shared memory, but I don't think most OS's swap
out shared memory, and even if they do, they usually have a kernel
configuration parameter to lock it into kernel memory. All the old
unixes locked the shared memory into kernel address space and in fact
this is why many of them required a kernel recompile to increase shared
memory. I hope the ones that have pagable shared memory have a way to
prevent it --- at least FreeBSD does, not sure about Linux.

Now, the disadvantages of large kernel cache, small PostgreSQL buffer
cache is that data has to be transfered to/from the kernel buffers, and
second, we can't control the kernel's cache replacement strategy, and
will probably not be able to in the near future, while we do control our
own buffer cache replacement strategy.

Looking at the advantages/disadvantages, a large shared buffer cache
looks pretty good to me.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Joe Conway <mail(at)joeconway(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jan Wieck <JanWieck(at)yahoo(dot)com>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 04:54:25
Message-ID: 3FAF1A01.3060301@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> Having another process do the writing does allow some paralellism, but
> people don't seem to care of buffers having to be read in from the
> kernel buffer cache, so what big benefit do we get by having someone
> else write into the kernel buffer cache, except allowing a central place
> to fsync, and is it worth it considering that it might be impossible to
> configure a system where the writer process can keep up with all the
> backends?

This might be far fetched, but I wonder if having a writer process opens
up the possibility of running PostgreSQL in a cluster? I'm thinking of
two servers, mounted to the same data volume, and some kind of
coordination between the writer processes. Anyone know if this is
similar to how Oracle handles RAC?

Joe


From: Joe Conway <mail(at)joeconway(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Sullivan <andrew(at)libertyrms(dot)info>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 05:00:06
Message-ID: 3FAF1B56.4050204@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> Agreed, we can't resize shared memory, but I don't think most OS's swap
> out shared memory, and even if they do, they usually have a kernel
> configuration parameter to lock it into kernel memory. All the old
> unixes locked the shared memory into kernel address space and in fact
> this is why many of them required a kernel recompile to increase shared
> memory. I hope the ones that have pagable shared memory have a way to
> prevent it --- at least FreeBSD does, not sure about Linux.

I'm pretty sure at least Linux, Solaris, and HPUX all work this way --
otherwise Oracle would have the same problem with their SGA, which is
kept in shared memory.

Joe


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 13:46:36
Message-ID: 3FAF96BC.9050506@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> I would be interested to know if you have the background write process
> writing old dirty buffers to kernel buffers continually if the sync()
> load is diminished. What this does is to push more dirty buffers into
> the kernel cache in hopes the OS will write those buffers on its own
> before the checkpoint does its write/sync work. This might allow us to
> reduce sync() load while preventing the need for O_SYNC/fsync().

I tried that first. Linux 2.4 does not, as long as you don't tell it by
reducing the dirty data block aging time with update(8). So you have to
force it to utilize the write bandwidth in the meantime. For that you
have to call sync() or fsync() on something.

Maybe O_SYNC is not as bad an option as it seems. In my patch, the
checkpointer flushes the buffers in LRU order, meaning it flushes the
least recently used ones first. This has the side effect that buffers
returned for replacement (on a cache miss, when the backend needs to
read the block) are most likely to be flushed/clean. So it reduces the
write load of backends and thus the probability that a backend is ever
blocked waiting on an O_SYNC'd write().

I will add some counters and gather some statistics how often the
backend in comparision to the checkpointer calls write().

>
> Perhaps sync() is bad partly because the checkpoint runs through all the
> dirty shared buffers and writes them all to the kernel and then issues
> sync() almost guaranteeing a flood of writes to the disk. This method
> would find fewer dirty buffers in the shared buffer cache, and therefore
> fewer kernel writes needed by sync().

I don't understand this? How would what method reduce the number of page
buffers the backends modify?

Jan

>
> ---------------------------------------------------------------------------
>
> Jan Wieck wrote:
>> Tom Lane wrote:
>>
>> > Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>> >
>> >> How I can see the background writer operating is that he's keeping the
>> >> buffers in the order of the LRU chain(s) clean, because those are the
>> >> buffers that most likely get replaced soon. In my experimental ARC code
>> >> it would traverse the T1 and T2 queues from LRU to MRU, write out n1 and
>> >> n2 dirty buffers (n1+n2 configurable), then fsync all files that have
>> >> been involved in that, nap depending on where he got down the queues (to
>> >> increase the write rate when running low on clean buffers), and do it
>> >> all over again.
>> >
>> > You probably need one more knob here: how often to issue the fsyncs.
>> > I'm not convinced "once per outer loop" is a sufficient answer.
>> > Otherwise this is sounding pretty good.
>>
>> This is definitely heading into the right direction.
>>
>> I currently have a crude and ugly hacked system, that does checkpoints
>> every minute but streches them out over the whole time. It writes out
>> the dirty buffers in T1+T2 LRU order intermixed, streches out the flush
>> over the whole checkpoint interval and does sync()+usleep() every 32
>> blocks (if it has time to do this).
>>
>> This is clearly the wrong way to implement it, but ...
>>
>> The same system has ARC and delayed vacuum. With normal, unmodified
>> checkpoints every 300 seconds, the transaction responsetime for
>> new_order still peaks at over 30 seconds (5 is already too much) so the
>> system basically come to a freeze during a checkpoint.
>>
>> Now with this high-frequent sync()ing and checkpointing by the minute,
>> the entire system load levels out really nice. Basically it's constantly
>> checkpointing. So maybe the thing we're looking for is to make the
>> checkpoint process the background buffer writer process and let it
>> checkpoint 'round the clock. Of course, with a bit more selectivity on
>> what to fsync and not doing system wide sync() every 10-500 milliseconds :-)
>>
>>
>> Jan
>>
>> --
>> #======================================================================#
>> # It's easier to get forgiveness for being wrong than for being right. #
>> # Let's break this rule - forgive me. #
>> #================================================== JanWieck(at)Yahoo(dot)com #
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 5: Have you checked our extensive FAQ?
>>
>> http://www.postgresql.org/docs/faqs/FAQ.html
>>
>

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 14:18:31
Message-ID: 3FAF9E37.5070400@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:

> Now, O_SYNC is going to force every write to the disk. If we have a
> transaction that has to write lots of buffers (has to write them to
> reuse the shared buffer)

So make the background writer/checkpointer keeping the LRU head clean. I
explained that 3 times now.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 14:23:37
Message-ID: 200311101423.hAAENbv10754@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:
> Bruce Momjian wrote:
> > I would be interested to know if you have the background write process
> > writing old dirty buffers to kernel buffers continually if the sync()
> > load is diminished. What this does is to push more dirty buffers into
> > the kernel cache in hopes the OS will write those buffers on its own
> > before the checkpoint does its write/sync work. This might allow us to
> > reduce sync() load while preventing the need for O_SYNC/fsync().
>
> I tried that first. Linux 2.4 does not, as long as you don't tell it by
> reducing the dirty data block aging time with update(8). So you have to
> force it to utilize the write bandwidth in the meantime. For that you
> have to call sync() or fsync() on something.
>
> Maybe O_SYNC is not as bad an option as it seems. In my patch, the
> checkpointer flushes the buffers in LRU order, meaning it flushes the
> least recently used ones first. This has the side effect that buffers
> returned for replacement (on a cache miss, when the backend needs to
> read the block) are most likely to be flushed/clean. So it reduces the
> write load of backends and thus the probability that a backend is ever
> blocked waiting on an O_SYNC'd write().
>
> I will add some counters and gather some statistics how often the
> backend in comparision to the checkpointer calls write().

OK, new idea. How about if you write() the buffers, mark them as clean
and unlock them, then issue fsync(). The advantage here is that we can
allow the buffer to be reused while we wait for the fsync to complete.
Obviously, O_SYNC is not going to allow that. Another idea --- if
fsync() is slow because it can't find the dirty buffers, use write() to
write the buffers, copy the buffer to local memory, mark it as clean,
then open the file with O_SYNC and write it again. Of course, I am just
throwing out ideas here. The big thing I am concerned about is that
reusing buffers not take too long.

> > Perhaps sync() is bad partly because the checkpoint runs through all the
> > dirty shared buffers and writes them all to the kernel and then issues
> > sync() almost guaranteeing a flood of writes to the disk. This method
> > would find fewer dirty buffers in the shared buffer cache, and therefore
> > fewer kernel writes needed by sync().
>
> I don't understand this? How would what method reduce the number of page
> buffers the backends modify?

What I was saying is that if we only write() just before a checkpoint,
we never give the kernel a chance to write the buffers on its own. I
figured if we wrote them earlier, the kernel might write them for us and
sync wouldn't need to do it.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 14:25:49
Message-ID: 200311101425.hAAEPnE10962@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:
> Bruce Momjian wrote:
>
> > Now, O_SYNC is going to force every write to the disk. If we have a
> > transaction that has to write lots of buffers (has to write them to
> > reuse the shared buffer)
>
> So make the background writer/checkpointer keeping the LRU head clean. I
> explained that 3 times now.

If the background cleaner has to not just write() but write/fsync or
write/O_SYNC, it isn't going to be able to clean them fast enough. It
creates a bottleneck where we didn't have one before.

We are trying to eliminate an I/O storm during checkpoint, but the
solutions seem to be making the non-checkpoint times slower.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 14:25:53
Message-ID: 3FAF9FF1.1010005@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

What bothers me a little is that you keep telling us that you have all
that great code from SRA. Do you have any idea when they intend to share
this with us and contribute the stuff? I mean at least some pieces
maybe? You personally got all the code from NuSphere AKA PeerDirect even
weeks before it got released. Did any PostgreSQL developer other than
you ever look at the SRA code?

Jan

Bruce Momjian wrote:

> scott.marlowe wrote:
>> On Tue, 4 Nov 2003, Tom Lane wrote:
>>
>> > Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>> > > What still needs to be addressed is the IO storm cause by checkpoints. I
>> > > see it much relaxed when stretching out the BufferSync() over most of
>> > > the time until the next one should occur. But the kernel sync at it's
>> > > end still pushes the system hard against the wall.
>> >
>> > I have never been happy with the fact that we use sync(2) at all. Quite
>> > aside from the "I/O storm" issue, sync() is really an unsafe way to do a
>> > checkpoint, because there is no way to be certain when it is done. And
>> > on top of that, it does too much, because it forces syncing of files
>> > unrelated to Postgres.
>> >
>> > I would like to see us go over to fsync, or some other technique that
>> > gives more certainty about when the write has occurred. There might be
>> > some scope that way to allow stretching out the I/O, too.
>> >
>> > The main problem with this is knowing which files need to be fsync'd.
>>
>> Wasn't this a problem that the win32 port had to solve by keeping a list
>> of all files that need fsyncing since Windows doesn't do sync() in the
>> classical sense? If so, then could we use that code to keep track of the
>> files that need fsyncing?
>
> Yes, I have that code from SRA. They used threading, so they recorded
> all the open files in local memory and opened/fsync/closed them for
> checkpoints. We have to store the file names in a shared area, perhaps
> an area of shared memory with an overflow to a disk file.
>

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 14:26:38
Message-ID: 18455.1068474398@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Now, if we are sure that writes will happen only in the checkpoint
> process, O_SYNC would be OK, I guess, but will we ever be sure of that?

This is a performance issue, not a correctness issue. It's okay for
backends to wait for writes as long as it happens very infrequently.
The question is whether we can design a background dirty-buffer writer
that works well enough to make it uncommon for backends to have to
write dirty buffers for themselves. If we can, then doing all the
writes O_SYNC would not be a problem.

(One possibility that could help improve the odds is to allow a certain
amount of slop in the LRU buffer reuse policy --- that is, if you see
the buffer at the tail of the LRU list is dirty, allow one of the next
few buffers to be taken instead, if it's clean. Or just keep separate
lists for dirty and clean buffers.)

regards, tom lane


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Sullivan <andrew(at)libertyrms(dot)info>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 14:35:18
Message-ID: 3FAFA226.4070006@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:

> Tom Lane wrote:
>> Andrew Sullivan <andrew(at)libertyrms(dot)info> writes:
>> > On Sun, Nov 02, 2003 at 01:00:35PM -0500, Tom Lane wrote:
>> >> real traction we'd have to go back to the "take over most of RAM for
>> >> shared buffers" approach, which we already know to have a bunch of
>> >> severe disadvantages.
>>
>> > I know there are severe disadvantages in the current implementation,
>> > but are there in-principle severe disadvantages?
>>
>> Yes. For one, since we cannot change the size of shared memory
>> on-the-fly (at least not portably), there is no opportunity to trade off
>> memory usage dynamically between processes and disk buffers. For
>> another, on many systems shared memory is subject to being swapped out.
>> Swapping out dirty buffers is a performance killer, because they must be
>> swapped back in again before they can be written to where they should
>> have gone. The only way to avoid this is to keep the number of shared
>> buffers small enough that they all remain fairly "hot" (recently used)
>> and so the kernel won't be tempted to swap out any part of the region.
>
> Agreed, we can't resize shared memory, but I don't think most OS's swap
> out shared memory, and even if they do, they usually have a kernel

We can't resize shared memory because we allocate the whole thing in one
big hump - which causes the shmmax problem BTW. If we allocate that in
chunks of multiple blocks, we only have to give it a total maximum size
to get the hash tables and other stuff right from the beginning. But the
vast majority of memory, the buffers themself, can be made adjustable at
runtime.

Jan

> configuration parameter to lock it into kernel memory. All the old
> unixes locked the shared memory into kernel address space and in fact
> this is why many of them required a kernel recompile to increase shared
> memory. I hope the ones that have pagable shared memory have a way to
> prevent it --- at least FreeBSD does, not sure about Linux.
>
> Now, the disadvantages of large kernel cache, small PostgreSQL buffer
> cache is that data has to be transfered to/from the kernel buffers, and
> second, we can't control the kernel's cache replacement strategy, and
> will probably not be able to in the near future, while we do control our
> own buffer cache replacement strategy.
>
> Looking at the advantages/disadvantages, a large shared buffer cache
> looks pretty good to me.
>

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 15:05:04
Message-ID: 3FAFA920.1050207@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:

> Jan Wieck wrote:
>> Bruce Momjian wrote:
>>
>> > Now, O_SYNC is going to force every write to the disk. If we have a
>> > transaction that has to write lots of buffers (has to write them to
>> > reuse the shared buffer)
>>
>> So make the background writer/checkpointer keeping the LRU head clean. I
>> explained that 3 times now.
>
> If the background cleaner has to not just write() but write/fsync or
> write/O_SYNC, it isn't going to be able to clean them fast enough. It
> creates a bottleneck where we didn't have one before.
>
> We are trying to eliminate an I/O storm during checkpoint, but the
> solutions seem to be making the non-checkpoint times slower.
>

It looks as if you're assuming that I am making the backends unable to
write on their own, so that they have to wait on the checkpointer. I
never said that.

If the checkpointer keeps the LRU heads clean, that lifts off write load
from the backends. Sure, they will be able to dirty pages faster.
Theoretically, because in practice if you have a reasonably good cache
hitrate, they will just find already dirty buffers where they just add
some more dust.

If after all the checkpointer (doing write()+whateversync) is not able
to keep up with the speed of buffers getting dirtied, the backends will
have to do some write()'s again, because they will eat up the clean
buffers at the LRU head and pass the checkpointer.

Also please notice another little change in behaviour. The old code just
went through the buffer cache sequentially, possibly flushing buffers
that got dirtied after the checkpoint started, which is way ahead of
time (they need to be flushed for the next checkpoint, not now). That
means, that if the same buffer gets dirtied again after that, we wasted
a full disk write on it. My new code creates a list of dirty blocks at
the beginning of the checkpoint, and flushes only those that are still
dirty at the time it gets to them.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 15:18:38
Message-ID: 3FAFAC4E.2060907@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:

> Jan Wieck wrote:
>> Bruce Momjian wrote:
>> > I would be interested to know if you have the background write process
>> > writing old dirty buffers to kernel buffers continually if the sync()
>> > load is diminished. What this does is to push more dirty buffers into
>> > the kernel cache in hopes the OS will write those buffers on its own
>> > before the checkpoint does its write/sync work. This might allow us to
>> > reduce sync() load while preventing the need for O_SYNC/fsync().
>>
>> I tried that first. Linux 2.4 does not, as long as you don't tell it by
>> reducing the dirty data block aging time with update(8). So you have to
>> force it to utilize the write bandwidth in the meantime. For that you
>> have to call sync() or fsync() on something.
>>
>> Maybe O_SYNC is not as bad an option as it seems. In my patch, the
>> checkpointer flushes the buffers in LRU order, meaning it flushes the
>> least recently used ones first. This has the side effect that buffers
>> returned for replacement (on a cache miss, when the backend needs to
>> read the block) are most likely to be flushed/clean. So it reduces the
>> write load of backends and thus the probability that a backend is ever
>> blocked waiting on an O_SYNC'd write().
>>
>> I will add some counters and gather some statistics how often the
>> backend in comparision to the checkpointer calls write().
>
> OK, new idea. How about if you write() the buffers, mark them as clean
> and unlock them, then issue fsync(). The advantage here is that we can

Not really new, I think in my first mail I wrote that I simplified this
new mdfsyncrecent() function by calling sync() instead ... other than
that the code I posted worked exactly that way.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 15:31:10
Message-ID: 3FAFAF3E.8090108@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>> Now, if we are sure that writes will happen only in the checkpoint
>> process, O_SYNC would be OK, I guess, but will we ever be sure of that?
>
> This is a performance issue, not a correctness issue. It's okay for
> backends to wait for writes as long as it happens very infrequently.
> The question is whether we can design a background dirty-buffer writer
> that works well enough to make it uncommon for backends to have to
> write dirty buffers for themselves. If we can, then doing all the
> writes O_SYNC would not be a problem.
>
> (One possibility that could help improve the odds is to allow a certain
> amount of slop in the LRU buffer reuse policy --- that is, if you see
> the buffer at the tail of the LRU list is dirty, allow one of the next
> few buffers to be taken instead, if it's clean. Or just keep separate
> lists for dirty and clean buffers.)

If the checkpointer is writing in LRU order (which is the order buffers
normally get replaced), this happening would mean that the backends have
replaced all clean buffers at the LRU head and this can only happen if
the currently running checkpointer is working way too slow. If it is
more than 30 seconds away from its target finish time, it would be a
good idea to restart by building a (guaranteed long now) new todo list
and write faster (but starting again at the LRU head). If it's too late
for that, stop napping, finish this checkpoint NOW and start a new one
immediately.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Neil Conway <neilc(at)samurai(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Sullivan <andrew(at)libertyrms(dot)info>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 16:40:45
Message-ID: 871xsg18j6.fsf@mailbox.samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Now, the disadvantages of large kernel cache, small PostgreSQL buffer
> cache is that data has to be transfered to/from the kernel buffers, and
> second, we can't control the kernel's cache replacement strategy, and
> will probably not be able to in the near future, while we do control our
> own buffer cache replacement strategy.

The intent of the posix_fadvise() work is to at least provide a
few hints about our I/O patterns to the kernel's buffer
cache. Although only Linux supports it (right now), that should
hopefully improve the status quo for a fairly significant portion of
our user base.

I'd be curious to see a comparison of the cost of transferring data
from the kernel's buffers to the PG bufmgr.

-Neil


From: Larry Rosenman <ler(at)lerctr(dot)org>
To: Neil Conway <neilc(at)samurai(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Sullivan <andrew(at)libertyrms(dot)info>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 17:57:35
Message-ID: 77700000.1068487055@lerlaptop-red.iadfw.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

--On Monday, November 10, 2003 11:40:45 -0500 Neil Conway
<neilc(at)samurai(dot)com> wrote:

> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>> Now, the disadvantages of large kernel cache, small PostgreSQL buffer
>> cache is that data has to be transfered to/from the kernel buffers, and
>> second, we can't control the kernel's cache replacement strategy, and
>> will probably not be able to in the near future, while we do control our
>> own buffer cache replacement strategy.
>
> The intent of the posix_fadvise() work is to at least provide a
> few hints about our I/O patterns to the kernel's buffer
> cache. Although only Linux supports it (right now), that should
> hopefully improve the status quo for a fairly significant portion of
> our user base.
>
> I'd be curious to see a comparison of the cost of transferring data
> from the kernel's buffers to the PG bufmgr.
You might also look at Veritas' advisory stuff. If you want exact doc
pointers, I can provide them, but they are in the Filesystem section
of http://www.lerctr.org:8458/

LER

>
> -Neil
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: ler(at)lerctr(dot)org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749


From: Neil Conway <neilc(at)samurai(dot)com>
To: Larry Rosenman <ler(at)lerctr(dot)org>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Sullivan <andrew(at)libertyrms(dot)info>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 18:40:24
Message-ID: 87vfpsysmf.fsf@mailbox.samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Larry Rosenman <ler(at)lerctr(dot)org> writes:
> You might also look at Veritas' advisory stuff.

Thanks for the suggestion -- it looks like we can make use of
this. For the curious, the cache advisory API is documented here:

http://www.lerctr.org:8458/en/man/html.7/vxfsio.7.html
http://www.lerctr.org:8458/en/ODM_FSadmin/fssag-9.html#MARKER-9-1

Note that unlike for posix_fadvise(), the docs for this functionality
explicitly state:

Some advisories are currently maintained on a per-file, not a
per-file-descriptor, basis. This means that only one set of
advisories can be in effect for all accesses to the file. If two
conflicting applications set different advisories, both use the
last advisories that were set.

-Neil


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org, Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 18:45:23
Message-ID: 200311101845.hAAIjN021643@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:
> What bothers me a little is that you keep telling us that you have all
> that great code from SRA. Do you have any idea when they intend to share
> this with us and contribute the stuff? I mean at least some pieces
> maybe? You personally got all the code from NuSphere AKA PeerDirect even
> weeks before it got released. Did any PostgreSQL developer other than
> you ever look at the SRA code?

I can get the open/fsync/write/close patch from SRA released, I think.
Let me ask them now.

Tom has seen the Win32 tarball (with SRA's approval) because he wanted
to research if threading was something we should pursue. I haven't
heard a report back from him yet. If you would like to see the tarball,
I can ask them.

Agreed, I got the PeerDirect/Nusphere code very early and it was a help.
I am sure I can get some of it released. I haven't pursued the sync
Win32 patch because it is based on a threaded backend model, so it is
different from how it need to be done in a process model (all shared
file descriptors). However, I will need to get approval in the end
anyway for Win32 because I need that Win32-specific part anyway.

I just looked at the sync() call in the code and it just did _flushall:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore98/html/_crt__flushall.asp

I can share this because I know it was discussed when someone (SRA?)
realized _commit() didn't force all buffers to disk. In fact, _commit
is fsync().

I think the only question was whether _flushall() fsync file descriptors
that have been closed. Perhaps SRA keeps the file descriptors open
until after the checkpoint, or does it fsync closed files with dirty
buffers. Tatsuo?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Andrew Sullivan <andrew(at)libertyrms(dot)info>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 18:51:54
Message-ID: 20031110185154.GW15754@libertyrms.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 09, 2003 at 08:54:25PM -0800, Joe Conway wrote:
> two servers, mounted to the same data volume, and some kind of
> coordination between the writer processes. Anyone know if this is
> similar to how Oracle handles RAC?

It is similar, yes, but there's some mighty powerful magic in that
"some kind of co-ordination". What do you do when one of the
particpants crashes, for instance?

A

--
----
Andrew Sullivan 204-4141 Yonge Street
Afilias Canada Toronto, Ontario Canada
<andrew(at)libertyrms(dot)info> M2P 2A8
+1 416 646 3304 x110


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 18:51:59
Message-ID: 200311101851.hAAIpxN22423@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Now, if we are sure that writes will happen only in the checkpoint
> > process, O_SYNC would be OK, I guess, but will we ever be sure of that?
>
> This is a performance issue, not a correctness issue. It's okay for
> backends to wait for writes as long as it happens very infrequently.
> The question is whether we can design a background dirty-buffer writer
> that works well enough to make it uncommon for backends to have to
> write dirty buffers for themselves. If we can, then doing all the
> writes O_SYNC would not be a problem.

Agreed. My concern is that right now we do write() in each backend.
Those writes are probably pretty fast, probably as fast as a read() when
the buffer is already in the kernel cache. The current discussion
involves centralizing most of the writes (centralization can be slower),
and having the writes forced to disk. That seems like it could be a
double-killer.

> (One possibility that could help improve the odds is to allow a certain
> amount of slop in the LRU buffer reuse policy --- that is, if you see
> the buffer at the tail of the LRU list is dirty, allow one of the next
> few buffers to be taken instead, if it's clean. Or just keep separate
> lists for dirty and clean buffers.)

Yes, I think you almost will have to split the LRU list into
dirty/clean, and that might make dirty buffers stay around longer.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Sullivan <andrew(at)libertyrms(dot)info>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 18:52:34
Message-ID: 200311101852.hAAIqYe22653@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:
> Bruce Momjian wrote:
>
> > Tom Lane wrote:
> >> Andrew Sullivan <andrew(at)libertyrms(dot)info> writes:
> >> > On Sun, Nov 02, 2003 at 01:00:35PM -0500, Tom Lane wrote:
> >> >> real traction we'd have to go back to the "take over most of RAM for
> >> >> shared buffers" approach, which we already know to have a bunch of
> >> >> severe disadvantages.
> >>
> >> > I know there are severe disadvantages in the current implementation,
> >> > but are there in-principle severe disadvantages?
> >>
> >> Yes. For one, since we cannot change the size of shared memory
> >> on-the-fly (at least not portably), there is no opportunity to trade off
> >> memory usage dynamically between processes and disk buffers. For
> >> another, on many systems shared memory is subject to being swapped out.
> >> Swapping out dirty buffers is a performance killer, because they must be
> >> swapped back in again before they can be written to where they should
> >> have gone. The only way to avoid this is to keep the number of shared
> >> buffers small enough that they all remain fairly "hot" (recently used)
> >> and so the kernel won't be tempted to swap out any part of the region.
> >
> > Agreed, we can't resize shared memory, but I don't think most OS's swap
> > out shared memory, and even if they do, they usually have a kernel
>
> We can't resize shared memory because we allocate the whole thing in one
> big hump - which causes the shmmax problem BTW. If we allocate that in
> chunks of multiple blocks, we only have to give it a total maximum size
> to get the hash tables and other stuff right from the beginning. But the
> vast majority of memory, the buffers themself, can be made adjustable at
> runtime.

That is an interesting idea.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 18:57:01
Message-ID: 200311101857.hAAIv1223046@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:
> > If the background cleaner has to not just write() but write/fsync or
> > write/O_SYNC, it isn't going to be able to clean them fast enough. It
> > creates a bottleneck where we didn't have one before.
> >
> > We are trying to eliminate an I/O storm during checkpoint, but the
> > solutions seem to be making the non-checkpoint times slower.
> >
>
> It looks as if you're assuming that I am making the backends unable to
> write on their own, so that they have to wait on the checkpointer. I
> never said that.
>
> If the checkpointer keeps the LRU heads clean, that lifts off write load
> from the backends. Sure, they will be able to dirty pages faster.
> Theoretically, because in practice if you have a reasonably good cache
> hitrate, they will just find already dirty buffers where they just add
> some more dust.
>
> If after all the checkpointer (doing write()+whateversync) is not able
> to keep up with the speed of buffers getting dirtied, the backends will
> have to do some write()'s again, because they will eat up the clean
> buffers at the LRU head and pass the checkpointer.

Yes, there are a couple of issues here --- first, have a background
writer to write dirty pages. This is good, no question. The bigger
question is removing sync() and using fsync() or O_SYNC for every write
--- if we do that, the backends doing private write will have to fsync
their writes too, meaning if the checkpointer can't keep up, we now have
backends doing slow writes too.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 18:57:56
Message-ID: 200311101857.hAAIvun23111@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:
> Bruce Momjian wrote:
>
> > Jan Wieck wrote:
> >> Bruce Momjian wrote:
> >>
> >> > Now, O_SYNC is going to force every write to the disk. If we have a
> >> > transaction that has to write lots of buffers (has to write them to
> >> > reuse the shared buffer)
> >>
> >> So make the background writer/checkpointer keeping the LRU head clean. I
> >> explained that 3 times now.
> >
> > If the background cleaner has to not just write() but write/fsync or
> > write/O_SYNC, it isn't going to be able to clean them fast enough. It
> > creates a bottleneck where we didn't have one before.
> >
> > We are trying to eliminate an I/O storm during checkpoint, but the
> > solutions seem to be making the non-checkpoint times slower.
> >
>
> It looks as if you're assuming that I am making the backends unable to
> write on their own, so that they have to wait on the checkpointer. I
> never said that.

Maybe I missed it but are those backend now doing write or write/fsync?
If the former, that is fine. If the later, it does seem slower than it
used to be.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 19:00:36
Message-ID: 200311101900.hAAJ0aV23508@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:
> Bruce Momjian wrote:
>
> > Jan Wieck wrote:
> >> Bruce Momjian wrote:
> >> > I would be interested to know if you have the background write process
> >> > writing old dirty buffers to kernel buffers continually if the sync()
> >> > load is diminished. What this does is to push more dirty buffers into
> >> > the kernel cache in hopes the OS will write those buffers on its own
> >> > before the checkpoint does its write/sync work. This might allow us to
> >> > reduce sync() load while preventing the need for O_SYNC/fsync().
> >>
> >> I tried that first. Linux 2.4 does not, as long as you don't tell it by
> >> reducing the dirty data block aging time with update(8). So you have to
> >> force it to utilize the write bandwidth in the meantime. For that you
> >> have to call sync() or fsync() on something.
> >>
> >> Maybe O_SYNC is not as bad an option as it seems. In my patch, the
> >> checkpointer flushes the buffers in LRU order, meaning it flushes the
> >> least recently used ones first. This has the side effect that buffers
> >> returned for replacement (on a cache miss, when the backend needs to
> >> read the block) are most likely to be flushed/clean. So it reduces the
> >> write load of backends and thus the probability that a backend is ever
> >> blocked waiting on an O_SYNC'd write().
> >>
> >> I will add some counters and gather some statistics how often the
> >> backend in comparision to the checkpointer calls write().
> >
> > OK, new idea. How about if you write() the buffers, mark them as clean
> > and unlock them, then issue fsync(). The advantage here is that we can
>
> Not really new, I think in my first mail I wrote that I simplified this
> new mdfsyncrecent() function by calling sync() instead ... other than
> that the code I posted worked exactly that way.

I am confused --- I was suggesting we call fsync after we write a few
blocks for a given table, and that was going to happen between
checkpoints. Is the sync() happening then or only at checkpoint time.

Sorry I am lost but there seems to be an email delay in my receiving the
replies.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Larry Rosenman <ler(at)lerctr(dot)org>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Sullivan <andrew(at)libertyrms(dot)info>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 19:05:22
Message-ID: 155300000.1068491122@lerlaptop-red.iadfw.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

--On Monday, November 10, 2003 13:40:24 -0500 Neil Conway
<neilc(at)samurai(dot)com> wrote:

> Larry Rosenman <ler(at)lerctr(dot)org> writes:
>> You might also look at Veritas' advisory stuff.
>
> Thanks for the suggestion -- it looks like we can make use of
> this. For the curious, the cache advisory API is documented here:
>
> http://www.lerctr.org:8458/en/man/html.7/vxfsio.7.html
> http://www.lerctr.org:8458/en/ODM_FSadmin/fssag-9.html#MARKER-9-1
>
> Note that unlike for posix_fadvise(), the docs for this functionality
> explicitly state:
>
> Some advisories are currently maintained on a per-file, not a
> per-file-descriptor, basis. This means that only one set of
> advisories can be in effect for all accesses to the file. If two
> conflicting applications set different advisories, both use the
> last advisories that were set.
BTW, if ANY developer wants to play with this, I can make an account for
them. I have ODM installed on lerami.lerctr.org (www.lerctr.org is a
CNAME).

LER

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: ler(at)lerctr(dot)org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749


From: Neil Conway <neilc(at)samurai(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Jan Wieck <JanWieck(at)yahoo(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 19:10:15
Message-ID: 87llqoyr8o.fsf@mailbox.samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Another idea --- if fsync() is slow because it can't find the dirty
> buffers, use write() to write the buffers, copy the buffer to local
> memory, mark it as clean, then open the file with O_SYNC and write
> it again.

Yuck.

Do we have any idea how many kernels are out there that implement
fsync() as poorly as HPUX apparently does? I'm just wondering if we're
contemplating spending a whole lot of effort to work around a bug that
is only present on an (old?) version of HPUX. Do typical BSD derived
kernels exhibit this behavior? What about Linux? Solaris?

-Neil


From: Neil Conway <neilc(at)samurai(dot)com>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Sullivan <andrew(at)libertyrms(dot)info>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 19:20:07
Message-ID: 87d6c0yqs8.fsf@mailbox.samurai.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> We can't resize shared memory because we allocate the whole thing in
> one big hump - which causes the shmmax problem BTW. If we allocate
> that in chunks of multiple blocks, we only have to give it a total
> maximum size to get the hash tables and other stuff right from the
> beginning. But the vast majority of memory, the buffers themself, can
> be made adjustable at runtime.

Yeah, writing a palloc()-style wrapper over shm has been suggested
before (by myself among others). You could do the shm allocation in
fixed-size blocks (say, 1 MB each), and then do our own memory
management to allocate and release smaller chunks of shm when
requested. I'm not sure what it really buys us, though: sure, we can
expand the shared buffer area to some degree, but

(a) how do we know what the right size of the shared buffer
area /should/ be? It is difficult enough to avoid running
the machine out of physical memory, let alone figure out
how much memory is being used by the kernel for the buffer
cache and how much we should use ourselves. I think the
DBA needs to configure this anyway.

(b) the amount of shm we can ultimately use is finite, so we
will still need to use a lot of caution when placing
dynamically-sized data structures in shm. A shm_alloc()
might help this somewhat, but I don't see how it would
remove the fundamental problem.

-Neil


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: Jan Wieck <JanWieck(at)yahoo(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 19:20:40
Message-ID: 200311101920.hAAJKeY25939@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Neil Conway wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Another idea --- if fsync() is slow because it can't find the dirty
> > buffers, use write() to write the buffers, copy the buffer to local
> > memory, mark it as clean, then open the file with O_SYNC and write
> > it again.
>
> Yuck.
>
> Do we have any idea how many kernels are out there that implement
> fsync() as poorly as HPUX apparently does? I'm just wondering if we're
> contemplating spending a whole lot of effort to work around a bug that
> is only present on an (old?) version of HPUX. Do typical BSD derived
> kernels exhibit this behavior? What about Linux? Solaris?

Not sure, but it almost doesn't even matter --- any solution which has
fsync/O_SYNC/sync() in a critical path, even the path of replacing dirty
buffers --- is going to be too slow, I am afraid. Doesn't matter how
fast fsync() is, it is going to be slow.

I think Tom's only issue with HPUX is that even if fsync is out of the
critical path (background writer) it is going to consume lots of CPU
time finding those dirty buffers --- not sure how slow that would be.
If it is really slow on HPUX, we could disable the fsync's for the
background writer and just how the OS writes those buffers aggressively.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 19:24:48
Message-ID: 3FAFE600.9010105@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:

> Jan Wieck wrote:
>> Bruce Momjian wrote:
>>
>> > Jan Wieck wrote:
>> >> Bruce Momjian wrote:
>> >>
>> >> > Now, O_SYNC is going to force every write to the disk. If we have a
>> >> > transaction that has to write lots of buffers (has to write them to
>> >> > reuse the shared buffer)
>> >>
>> >> So make the background writer/checkpointer keeping the LRU head clean. I
>> >> explained that 3 times now.
>> >
>> > If the background cleaner has to not just write() but write/fsync or
>> > write/O_SYNC, it isn't going to be able to clean them fast enough. It
>> > creates a bottleneck where we didn't have one before.
>> >
>> > We are trying to eliminate an I/O storm during checkpoint, but the
>> > solutions seem to be making the non-checkpoint times slower.
>> >
>>
>> It looks as if you're assuming that I am making the backends unable to
>> write on their own, so that they have to wait on the checkpointer. I
>> never said that.
>
> Maybe I missed it but are those backend now doing write or write/fsync?
> If the former, that is fine. If the later, it does seem slower than it
> used to be.

In my all_performance.v4.diff they do write and the checkpointer does
write+sync.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 19:34:24
Message-ID: 200311101934.hAAJYOb27581@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:
> >> > If the background cleaner has to not just write() but write/fsync or
> >> > write/O_SYNC, it isn't going to be able to clean them fast enough. It
> >> > creates a bottleneck where we didn't have one before.
> >> >
> >> > We are trying to eliminate an I/O storm during checkpoint, but the
> >> > solutions seem to be making the non-checkpoint times slower.
> >> >
> >>
> >> It looks as if you're assuming that I am making the backends unable to
> >> write on their own, so that they have to wait on the checkpointer. I
> >> never said that.
> >
> > Maybe I missed it but are those backend now doing write or write/fsync?
> > If the former, that is fine. If the later, it does seem slower than it
> > used to be.
>
> In my all_performance.v4.diff they do write and the checkpointer does
> write+sync.

Again, sorry to be confusing --- I might be good to try write/fsync from
the background writer if backends can do writes on their own too without
fsync. The additional fsync from the background writer should reduce
disk writing during sync(). (The fsync should happen with the buffer
unlocked.)

You stated you didn't see improvement when the background writer did
non-checkpoint writes unless you modified update(4). Adding fsync might
correct that.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: Jan Wieck <JanWieck(at)yahoo(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ang Chin Han <angch(at)bytecraft(dot)com(dot)my>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 20:36:36
Message-ID: 200311102036.hAAKaaX06203@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Neil Conway wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Another idea --- if fsync() is slow because it can't find the dirty
> > buffers, use write() to write the buffers, copy the buffer to local
> > memory, mark it as clean, then open the file with O_SYNC and write
> > it again.
>
> Yuck.

This idea if mine will not even work unless others are prevented from
writing that data block while I am fsync'ing from local memory --- what
if someone modified and wrote that block before my block did its fsync
write? I would overwrite their new data. It was just a crazy idea.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 21:16:38
Message-ID: 3FB00036.7090206@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:

> Jan Wieck wrote:
>> >> > If the background cleaner has to not just write() but write/fsync or
>> >> > write/O_SYNC, it isn't going to be able to clean them fast enough. It
>> >> > creates a bottleneck where we didn't have one before.
>> >> >
>> >> > We are trying to eliminate an I/O storm during checkpoint, but the
>> >> > solutions seem to be making the non-checkpoint times slower.
>> >> >
>> >>
>> >> It looks as if you're assuming that I am making the backends unable to
>> >> write on their own, so that they have to wait on the checkpointer. I
>> >> never said that.
>> >
>> > Maybe I missed it but are those backend now doing write or write/fsync?
>> > If the former, that is fine. If the later, it does seem slower than it
>> > used to be.
>>
>> In my all_performance.v4.diff they do write and the checkpointer does
>> write+sync.
>
> Again, sorry to be confusing --- I might be good to try write/fsync from
> the background writer if backends can do writes on their own too without
> fsync. The additional fsync from the background writer should reduce
> disk writing during sync(). (The fsync should happen with the buffer
> unlocked.)

No, you're not. But thank you for suggesting what I implemented.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-10 22:22:55
Message-ID: 200311102222.hAAMMtP21920@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Wieck wrote:
> Bruce Momjian wrote:
>
> > Jan Wieck wrote:
> >> >> > If the background cleaner has to not just write() but write/fsync or
> >> >> > write/O_SYNC, it isn't going to be able to clean them fast enough. It
> >> >> > creates a bottleneck where we didn't have one before.
> >> >> >
> >> >> > We are trying to eliminate an I/O storm during checkpoint, but the
> >> >> > solutions seem to be making the non-checkpoint times slower.
> >> >> >
> >> >>
> >> >> It looks as if you're assuming that I am making the backends unable to
> >> >> write on their own, so that they have to wait on the checkpointer. I
> >> >> never said that.
> >> >
> >> > Maybe I missed it but are those backend now doing write or write/fsync?
> >> > If the former, that is fine. If the later, it does seem slower than it
> >> > used to be.
> >>
> >> In my all_performance.v4.diff they do write and the checkpointer does
> >> write+sync.
> >
> > Again, sorry to be confusing --- I might be good to try write/fsync from
> > the background writer if backends can do writes on their own too without
> > fsync. The additional fsync from the background writer should reduce
> > disk writing during sync(). (The fsync should happen with the buffer
> > unlocked.)
>
> No, you're not. But thank you for suggesting what I implemented.

OK, I did IM with Jan and I understand now --- he is using write/sync
for testing, but plans to allow several ways to force writes to disk
occasionally, probably defaulting to fsync on most platforms. Backend
will still use write only, and a checkpoint will continue using sync().

The qustion still open is whether we can push most/all writes into the
background writer so we can use fsync/open instead of sync. My point
has been that this might be hard to do with the same performance we have
now.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Andrew Sullivan <andrew(at)libertyrms(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-11 00:16:12
Message-ID: 3FB02A4C.20608@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Sullivan wrote:

> On Sun, Nov 09, 2003 at 08:54:25PM -0800, Joe Conway wrote:
>> two servers, mounted to the same data volume, and some kind of
>> coordination between the writer processes. Anyone know if this is
>> similar to how Oracle handles RAC?
>
> It is similar, yes, but there's some mighty powerful magic in that
> "some kind of co-ordination". What do you do when one of the
> particpants crashes, for instance?

What about "sympathetic crash"?

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: pgman(at)candle(dot)pha(dot)pa(dot)us
Cc: JanWieck(at)Yahoo(dot)com, scott(dot)marlowe(at)ihs(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, angch(at)bytecraft(dot)com(dot)my, cbbrowne(at)acm(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-11 01:33:44
Message-ID: 20031111.103344.74750641.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Jan Wieck wrote:
> > What bothers me a little is that you keep telling us that you have all
> > that great code from SRA. Do you have any idea when they intend to share
> > this with us and contribute the stuff? I mean at least some pieces
> > maybe? You personally got all the code from NuSphere AKA PeerDirect even
> > weeks before it got released. Did any PostgreSQL developer other than
> > you ever look at the SRA code?
>
> I can get the open/fsync/write/close patch from SRA released, I think.
> Let me ask them now.

I will ask my boss then come back with the result.

> Tom has seen the Win32 tarball (with SRA's approval) because he wanted
> to research if threading was something we should pursue. I haven't
> heard a report back from him yet. If you would like to see the tarball,
> I can ask them.
>
> Agreed, I got the PeerDirect/Nusphere code very early and it was a help.
> I am sure I can get some of it released. I haven't pursued the sync
> Win32 patch because it is based on a threaded backend model, so it is
> different from how it need to be done in a process model (all shared
> file descriptors). However, I will need to get approval in the end
> anyway for Win32 because I need that Win32-specific part anyway.
>
> I just looked at the sync() call in the code and it just did _flushall:
>
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore98/html/_crt__flushall.asp
>
> I can share this because I know it was discussed when someone (SRA?)
> realized _commit() didn't force all buffers to disk. In fact, _commit
> is fsync().
>
> I think the only question was whether _flushall() fsync file descriptors
> that have been closed. Perhaps SRA keeps the file descriptors open
> until after the checkpoint, or does it fsync closed files with dirty
> buffers. Tatsuo?

In the SRA's code, the checkpoint thread opens each file (if it's not
already open of course) which has been written then fsync() it.
--
Tatsuo Ishii


From: Shridhar Daithankar <shridhar_daithankar(at)myrealbox(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-11 06:19:54
Message-ID: 200311111149.54610.shridhar_daithankar@myrealbox.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday 11 November 2003 00:50, Neil Conway wrote:
> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> > We can't resize shared memory because we allocate the whole thing in
> > one big hump - which causes the shmmax problem BTW. If we allocate
> > that in chunks of multiple blocks, we only have to give it a total
> > maximum size to get the hash tables and other stuff right from the
> > beginning. But the vast majority of memory, the buffers themself, can
> > be made adjustable at runtime.
>
> Yeah, writing a palloc()-style wrapper over shm has been suggested
> before (by myself among others). You could do the shm allocation in
> fixed-size blocks (say, 1 MB each), and then do our own memory
> management to allocate and release smaller chunks of shm when
> requested. I'm not sure what it really buys us, though: sure, we can
> expand the shared buffer area to some degree, but

Thinking of it, it can be put as follows. Postgresql needs shared memory
between all the backends.

If the parent postmaster mmaps anonymous memory segments and shares them with
children, postgresql wouldn't be dependent upon any kernel resourse aka
shared memory anymore.

Furthermore parent posmaster can allocate different anonymous mappings for
different databases. In addition to postgresql buffer manager overhaul, this
would make things lot better.

note that I am not suggesting mmap to maintain files on disk. So I guess that
should be OK.

I tried searching for mmap on hackers. The threads seem to be very old. One in
1998. with so many proposals of rewriting core stuff, does this have any
chance?

Just a thought.

Shridhar


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Shridhar Daithankar <shridhar_daithankar(at)myrealbox(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-11 13:25:07
Message-ID: 3FB0E333.9040803@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Shridhar Daithankar wrote:
> On Tuesday 11 November 2003 00:50, Neil Conway wrote:
>> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>> > We can't resize shared memory because we allocate the whole thing in
>> > one big hump - which causes the shmmax problem BTW. If we allocate
>> > that in chunks of multiple blocks, we only have to give it a total
>> > maximum size to get the hash tables and other stuff right from the
>> > beginning. But the vast majority of memory, the buffers themself, can
>> > be made adjustable at runtime.
>>
>> Yeah, writing a palloc()-style wrapper over shm has been suggested
>> before (by myself among others). You could do the shm allocation in
>> fixed-size blocks (say, 1 MB each), and then do our own memory
>> management to allocate and release smaller chunks of shm when
>> requested. I'm not sure what it really buys us, though: sure, we can
>> expand the shared buffer area to some degree, but
>
> Thinking of it, it can be put as follows. Postgresql needs shared memory
> between all the backends.
>
> If the parent postmaster mmaps anonymous memory segments and shares them with
> children, postgresql wouldn't be dependent upon any kernel resourse aka
> shared memory anymore.

And how does a newly mmap'ed segment propagate into a running backend?

Jan

>
> Furthermore parent posmaster can allocate different anonymous mappings for
> different databases. In addition to postgresql buffer manager overhaul, this
> would make things lot better.
>
> note that I am not suggesting mmap to maintain files on disk. So I guess that
> should be OK.
>
> I tried searching for mmap on hackers. The threads seem to be very old. One in
> 1998. with so many proposals of rewriting core stuff, does this have any
> chance?
>
> Just a thought.
>
> Shridhar
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Shridhar Daithankar <shridhar_daithankar(at)myrealbox(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-11 13:34:00
Message-ID: 200311111904.00772.shridhar_daithankar@myrealbox.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday 11 November 2003 18:55, Jan Wieck wrote:
> Shridhar Daithankar wrote:
> > On Tuesday 11 November 2003 00:50, Neil Conway wrote:
> >> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> >> > We can't resize shared memory because we allocate the whole thing in
> >> > one big hump - which causes the shmmax problem BTW. If we allocate
> >> > that in chunks of multiple blocks, we only have to give it a total
> >> > maximum size to get the hash tables and other stuff right from the
> >> > beginning. But the vast majority of memory, the buffers themself, can
> >> > be made adjustable at runtime.
> >>
> >> Yeah, writing a palloc()-style wrapper over shm has been suggested
> >> before (by myself among others). You could do the shm allocation in
> >> fixed-size blocks (say, 1 MB each), and then do our own memory
> >> management to allocate and release smaller chunks of shm when
> >> requested. I'm not sure what it really buys us, though: sure, we can
> >> expand the shared buffer area to some degree, but
> >
> > Thinking of it, it can be put as follows. Postgresql needs shared memory
> > between all the backends.
> >
> > If the parent postmaster mmaps anonymous memory segments and shares them
> > with children, postgresql wouldn't be dependent upon any kernel resourse
> > aka shared memory anymore.
>
> And how does a newly mmap'ed segment propagate into a running backend?

It wouldn't. Just like we allocate fixed amount of shared memory at startup
now, we would do same for mmaped segments. Allocate maximum configured on
startup. But it won't be into kernel space as much shared memory segment
would be.

Anyway we wouldn't be mmaping one segment per page. That might be just too
much mmapping. We could just mmap entire configured are and go ahead.

I like the possibility of isolating shared buffers per database in this
approach. I don't know how much useful it would be in practice..

Shridhar


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Shridhar Daithankar <shridhar_daithankar(at)myrealbox(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-11 16:17:59
Message-ID: 3FB10BB7.3090203@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Shridhar Daithankar wrote:

> On Tuesday 11 November 2003 18:55, Jan Wieck wrote:

>> And how does a newly mmap'ed segment propagate into a running backend?
>
> It wouldn't. Just like we allocate fixed amount of shared memory at startup
> now, we would do same for mmaped segments. Allocate maximum configured on
> startup. But it won't be into kernel space as much shared memory segment
> would be.

I don't understand that, can you explain this like you would to a child?

I want to configure my postmaster for a maximum of 256MB shared memory
(or 32768 pages), but I want to start it using 128MB (16384 pages) only.
Now while some backends that inherited the 128MB are running, I want to
increase the shared memory to 256MB, run some job and shrink it back to
128MB. How do the backends that inherited 128MB access a buffer in the
other 128MB if they happen to get a cache hit? How does that all work
with anon mmap segments?

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-11 17:20:47
Message-ID: 87ekwekej4.fsf@stark.dyndns.tv
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Shridhar Daithankar <shridhar_daithankar(at)myrealbox(dot)com> writes:

> If the parent postmaster mmaps anonymous memory segments and shares them with
> children, postgresql wouldn't be dependent upon any kernel resourse aka
> shared memory anymore.

Anonymous memory mappings aren't shared, at least not unless you're talking
about creating posix threads. That's just not how you create shared mappings
using mmap.

There is a way to create shared mappings using mmap, but it's exactly what you
say you don't want to do -- you use file mappings.

Using mmap postgres could allocate as much shared memory as it needs whenever
it needs it. You create a file the size of the mapping you want, you mmap it
with MAP_SHARED, then you arrange to have any other backends that want access
to it to mmap it as well.

I'm not sure why you say you don't want to map files. If you're afraid it will
cause lots of i/o as the system tries to flush these writes, well, in theory
that's up to the kernel to avoid. On systems where the kernel does poorly at
this there are tools like MAP_LOCK/mlock/shmfs that might trick it into doing
a better job.

Actually I've been wondering how hard it would be to avoid this whole
double-buffering issue and having postgres mmap the buffers it wants from the
data files. That would avoid the double-buffering entirely including the extra
copy and memory use. But it would be a major change to a lot of core stuff.
And it be tricky to ensure WAL buffers are written before data blocks.

--
greg


From: Shridhar Daithankar <shridhar_daithankar(at)myrealbox(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Experimental patch for inter-page delay in VACUUM
Date: 2003-11-12 05:17:45
Message-ID: 3FB1C279.3010109@myrealbox.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark wrote:

> Shridhar Daithankar <shridhar_daithankar(at)myrealbox(dot)com> writes:
>
>
>>If the parent postmaster mmaps anonymous memory segments and shares them with
>>children, postgresql wouldn't be dependent upon any kernel resourse aka
>>shared memory anymore.
>
>
> Anonymous memory mappings aren't shared, at least not unless you're talking
> about creating posix threads. That's just not how you create shared mappings
> using mmap.
>
> There is a way to create shared mappings using mmap, but it's exactly what you
> say you don't want to do -- you use file mappings.
>
> Using mmap postgres could allocate as much shared memory as it needs whenever
> it needs it. You create a file the size of the mapping you want, you mmap it
> with MAP_SHARED, then you arrange to have any other backends that want access
> to it to mmap it as well.

Yes. It occurred to me in the morning. For sure, a good night sleep helps..
>
> I'm not sure why you say you don't want to map files. If you're afraid it will
> cause lots of i/o as the system tries to flush these writes, well, in theory
> that's up to the kernel to avoid. On systems where the kernel does poorly at
> this there are tools like MAP_LOCK/mlock/shmfs that might trick it into doing
> a better job.

I didn't have any file in my first post because I saw it as unnecessary. However
my guess is IO caused by such file would not be much. How muh shared bufffers
postgresql would be using anyways? 100MB? 200MB?

On the bright side, system will automatically sync the shared buffers
periodically. It is like taking snapshot of shaerd buffers. Could be good at
debugging.

If the IO caused by such a shared memory image is really an issue for somebody,
they can just map the file on a ramdrive.

Actaully I would say that would be a good default approach. Use mmaped file over
RAM drive as shared buffers. Just wondering if it can be done programmatically.

> Actually I've been wondering how hard it would be to avoid this whole
> double-buffering issue and having postgres mmap the buffers it wants from the
> data files. That would avoid the double-buffering entirely including the extra
> copy and memory use. But it would be a major change to a lot of core stuff.
> And it be tricky to ensure WAL buffers are written before data blocks.

Yes. I understand mmap is not adequete for WAL and other transaction syncing
requirement.

Bye
Shridhar