Quick Links

Re: Seq scans status update

Lists:	pgsql-patches

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Seq scans status update
Date:	2007-05-17 16:32:05
Message-ID:	464C8385.7030409@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Attached is a new version of Simon's "scan-resistant buffer manager"
patch. It's not ready for committing yet because of a small issue I
found this morning (* see bottom), but here's a status update.

To recap, the basic idea is to use a small ring of buffers for large
scans like VACUUM, COPY and seq-scans. Changes to the original patch:

- a different sized ring is used for VACUUMs and seq-scans, and COPY.
VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096
buffers in default configuration. See README changes in the patch for
rationale.

- for queries with large seqscans, the buffer ring is only used for
reads issued by the seq scan, not for any other reads in the query.
Typical scenario where this matters is doing a large seq scan with a
nested loop join to a smaller table. You don't want to use the buffer
ring for index lookups inside the nested loop.

- for seqscans, drop buffers from the ring that would need a WAL flush
to reuse. That makes bulk updates to behave roughly like they do without
the patch, instead of having to do a WAL flush every 32 pages.

I've spent a lot of time thinking of solutions to the last point. The
obvious solution would be to not use the buffer ring for updating scans.
The difficulty with that is that we don't know if a scan is read-only in
heapam.c, where the hint to use the buffer ring is set.

I've completed a set of performance tests on a test server. The server
has 4 GB of RAM, of which 1 GB is used for shared_buffers.

Results for a 10 GB table:

In other words, the patch has no significant effect, as expected. The
select times did go up by a couple of seconds, which I didn't expect,
though. One theory is that unused shared_buffers are swapped out during
the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and
try again at some point.

Results for a 2 GB table:

copy-medsize-unpatched | 00:02:18.23246
copy-medsize-unpatched | 00:02:22.347194
copy-medsize-unpatched | 00:02:23.875874
copy_nowal-medsize-unpatched | 00:01:27.606334
copy_nowal-medsize-unpatched | 00:01:17.491243
copy_nowal-medsize-unpatched | 00:01:31.902719
select-medsize-unpatched | 00:00:03.786031
select-medsize-unpatched | 00:00:02.678069
select-medsize-unpatched | 00:00:02.666103
select-medsize-unpatched | 00:00:02.673494
select-medsize-unpatched | 00:00:02.669645
select-medsize-unpatched | 00:00:02.666278
vacuum_clean-medsize-unpatched | 00:00:01.091356
vacuum_clean-medsize-unpatched | 00:00:01.923138
vacuum_clean-medsize-unpatched | 00:00:01.917213
vacuum_clean-medsize-unpatched | 00:00:01.917333
vacuum_hintbits-medsize-unpatched | 00:00:01.683718
vacuum_hintbits-medsize-unpatched | 00:00:01.864003
vacuum_hintbits-medsize-unpatched | 00:00:03.186596
vacuum_hintbits-medsize-unpatched | 00:00:02.16494
copy-medsize-patched | 00:02:35.113501
copy-medsize-patched | 00:02:25.269866
copy-medsize-patched | 00:02:31.881089
copy_nowal-medsize-patched | 00:01:00.254633
copy_nowal-medsize-patched | 00:01:04.630687
copy_nowal-medsize-patched | 00:01:03.729128
select-medsize-patched | 00:00:03.201837
select-medsize-patched | 00:00:01.332975
select-medsize-patched | 00:00:01.33014
select-medsize-patched | 00:00:01.332392
select-medsize-patched | 00:00:01.333498
select-medsize-patched | 00:00:01.332692
vacuum_clean-medsize-patched | 00:00:01.140189
vacuum_clean-medsize-patched | 00:00:01.062762
vacuum_clean-medsize-patched | 00:00:01.062402
vacuum_clean-medsize-patched | 00:00:01.07113
vacuum_hintbits-medsize-patched | 00:00:17.865446
vacuum_hintbits-medsize-patched | 00:00:15.162064
vacuum_hintbits-medsize-patched | 00:00:01.704651
vacuum_hintbits-medsize-patched | 00:00:02.671651

This looks good to me, except for some glitch at the last
vacuum_hintbits tests. Selects and vacuums benefit significantly, as
does non-WAL-logged copy.

Not shown here, but I run tests earlier with vacuum on a table that
actually had dead tuples to be removed on it. In that test the patched
version really shined, reducing the runtime to ~ 1/6th. That was the
original motivation of this patch: not having to do a WAL flush on every
page in the 2nd phase of vacuum.

Test script attached. To use it:

1. Edit testscript.sh. Change BIGTABLESIZE.
2. Start postmaster
3. Run script, giving test-label as argument. For example:
"./testscript.sh bigtable-patched"

Attached is also the patch I used for the tests.

I would appreciate it if people would download the patch and the script
and repeat the tests on different hardware. I'm particularly interested
in testing on a box with good I/O hardware where selects on unpatched
PostgreSQL are bottlenecked by CPU.

Barring any surprises I'm going to fix the remaining issue and submit a
final patch, probably in the weekend.

(*) The issue with this patch is that if the buffer cache is completely
filled with dirty buffers that need a WAL flush to evict, the buffer
ring code will get into an infinite loop trying to find one that doesn't
need a WAL flush. Should be simple to fix.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
testscript.sh	application/x-shellscript	2.6 KB
sr-fixed-7.patch	text/x-diff	29.1 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-17 17:03:23
Message-ID:	2852.1179421403@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> I've completed a set of performance tests on a test server. The server
> has 4 GB of RAM, of which 1 GB is used for shared_buffers.

Perhaps I'm misreading it, but these tests seem to show no improvement
worth spending any effort on --- some of the tests improve a bit but
many get worse, and that's for tests allegedly designed to highlight the
improvement; there's been no attempt to measure the distributed cost of
the additional logic in scenarios where it's not helpful. To say
nothing of the likelihood that it will be flat-out counterproductive
in yet other scenarios.

regression=# select id,sum(rt),count(*) from times group by id;
id | sum | count
------------+-----------------+-------
10G | 01:15:53.497114 | 20
10Gpatched | 01:12:51.749906 | 20
2G | 00:11:54.343741 | 20
2Gpatched | 00:11:32.482733 | 20
(4 rows)

Should we not just reject the patch and move on to something more
useful?

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-17 17:28:13
Message-ID:	464C90AD.5040803@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> I've completed a set of performance tests on a test server. The server
>> has 4 GB of RAM, of which 1 GB is used for shared_buffers.
>
> Perhaps I'm misreading it, but these tests seem to show no improvement
> worth spending any effort on --- some of the tests improve a bit but
> many get worse, and that's for tests allegedly designed to highlight the
> improvement; there's been no attempt to measure the distributed cost of
> the additional logic in scenarios where it's not helpful. To say
> nothing of the likelihood that it will be flat-out counterproductive
> in yet other scenarios.

Yeah, possibly. I've been thinking hard of scenarios where this would be
counterproductive, bulk delete is one that the original patch hurt, but
I think I have all interesting scenarios covered now.

> Should we not just reject the patch and move on to something more
> useful?

If Luke is right, the L2 cache effect that's visible in these in-memory
tests:

is also visible on larger scans that don't fit in cache with bigger I/O
hardware, and this patch would increase the max. I/O throughput that we
can handle on such hardware. I don't have such hardware available, I
hope someone else will try that.

In addition to that, this should reduce the cache-spoiling effect of big
scans and COPY. That's harder to quantify, I've been planning to run a
TPC-C test with a large SELECT COUNT(*) running in the background to
measure that, but me and the server has been busy with those other tests.

In any case, I do want this for VACUUMs to fix the "WAL flush for every
dirty page" problem. Maybe we should indeed drop the other aspects of
the patch and move on, I'm getting tired of this as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-17 18:00:35
Message-ID:	4329.1179424835@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> In any case, I do want this for VACUUMs to fix the "WAL flush for every
> dirty page" problem. Maybe we should indeed drop the other aspects of
> the patch and move on, I'm getting tired of this as well.

Can we devise a small patch that fixes that issue without creating
uncertain impacts on other behavior?

The thing that has made me uncomfortable with this set of patches right
along is the presumption that it's a good idea to tweak cache behavior
to improve a small set of cases. We are using cache behavior (now
clock-sweep, formerly LRU) that is well tested and proven across a large
body of experience; my gut feeling is that deviating from that behavior
is likely to be a net loss when the big picture is considered. I
certainly have not seen any test results that try to prove that there's
no such generalized disadvantage to these patches.

One could argue that the real bug here is that VACUUM deviates from the
standard behavior by forcing immediate recycling of pages it releases,
and that getting rid of that wart is the correct answer rather than
piling more warts atop it --- especially warts that will change the
behavior for anything besides VACUUM.

regards, tom lane

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-18 05:38:01
Message-ID:	C27289C9.30916%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Hi Heikki,

On 5/17/07 10:28 AM, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> wrote:

> is also visible on larger scans that don't fit in cache with bigger I/O
> hardware, and this patch would increase the max. I/O throughput that we
> can handle on such hardware. I don't have such hardware available, I
> hope someone else will try that.

Yes, this is absolutely the case, in addition to the benefits of not
polluting the bufcache with seq scans (as discussed in detail previously).
We've adopted this (see CK's patch) with excellent benefits.

We can try your version on a machine with fast I/O and get back to you with
a comparison of this and CK's version.

- Luke

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	CK Tan <cktan(at)greenplum(dot)com>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, John Eshleman <jeshleman(at)greenplum(dot)com>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-18 08:33:00
Message-ID:	464D64BC.8050401@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

CK Tan wrote:
> If it is convenient for you, could you run my patch against the same
> hardware and data to get some numbers on select for comparison? Although
> we don't address updates, copy, or inserts, we are definitely getting
> at least 20% improvement in scans here without poisoning the bufpool for
> large tables.

Sure, here you are:

Select performance looks the same as with Simon's/my patch.

That 20% gain must be hardware-dependent. My interpretation is that the
CPU savings from more efficient cache usage only matters when the I/O
bandwidth is high enough that CPU becomes the bottleneck.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-18 09:07:04
Message-ID:	464D6CB8.9030200@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> In any case, I do want this for VACUUMs to fix the "WAL flush for every
>> dirty page" problem. Maybe we should indeed drop the other aspects of
>> the patch and move on, I'm getting tired of this as well.
>
> Can we devise a small patch that fixes that issue without creating
> uncertain impacts on other behavior?

Yeah certainly, that's my backup plan.

> The thing that has made me uncomfortable with this set of patches right
> along is the presumption that it's a good idea to tweak cache behavior
> to improve a small set of cases. We are using cache behavior (now
> clock-sweep, formerly LRU) that is well tested and proven across a large
> body of experience; my gut feeling is that deviating from that behavior
> is likely to be a net loss when the big picture is considered. I
> certainly have not seen any test results that try to prove that there's
> no such generalized disadvantage to these patches.

That was my initial reaction as well. However, people claimed that using
a small number of buffers you can achieve higher raw throughput.
Avoiding the cache spoiling sounds plausible as well, if you're going to
do a seq scan of a table larger than shared_buffers, you know those
pages are not going to be needed in the near future; you're going to
replace them yourself with pages from the same scan.

> One could argue that the real bug here is that VACUUM deviates from the
> standard behavior by forcing immediate recycling of pages it releases,
> and that getting rid of that wart is the correct answer rather than
> piling more warts atop it --- especially warts that will change the
> behavior for anything besides VACUUM.

Maybe a better approach would be switching to an algorithm that's more
resistant to sequential scans by nature. That's definitely not going to
happen for 8.3, though.

It's also possible that whatever algorithm we choose doesn't make much
difference in practice because the in typical configurations the OS
cache is bigger and more significant anyway. It's also possible that
there's some counter-productive interactions between our and OS cache
management.

In any case, I'd like to see more test results before we make a
decision. I'm running tests with DBT-2 and a seq scan running in the
background to see if the cache-spoiling effect shows up. I'm also trying
to get hold of some bigger hardware to run on. Running these tests takes
some calendar time, but the hard work has already been done. I'm going
to start reviewing Jeff's synchronized scans patch now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-20 13:50:09
Message-ID:	46505211.4020800@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas wrote:
> In any case, I'd like to see more test results before we make a
> decision. I'm running tests with DBT-2 and a seq scan running in the
> background to see if the cache-spoiling effect shows up. I'm also trying
> to get hold of some bigger hardware to run on. Running these tests takes
> some calendar time, but the hard work has already been done. I'm going
> to start reviewing Jeff's synchronized scans patch now.

Here are the results of the DBT-2 tests:

http://community.enterprisedb.com/seqscan/imola/

In each of these tests, at the end of rampup a script is started that
issues a "SELECT COUNT(*) FROM stock" in a loop, with 2 minute delay
between end of previous query and start of next one.

The patch makes the seq scans go significantly faster. In the 1 hour
test period, the patched tests perform roughly 30-100% as many selects
as unpatched tests.

With 100 and 105 warehouses, it also significantly reduces the impact of
the seq scan on other queries; response times are lower with the patch.
With 120 warehouses the reduction of impact is not as clear, but when
you plot the response times it's still there (the plots on the "response
times charts"-page are useless because they're overwhelmed by the
checkpoint spike).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Patches <pgsql-patches(at)postgresql(dot)org>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Seq scans status update
Date:	2007-05-21 07:07:26
Message-ID:	4651452E.4060805@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

I forgot to attach the program used to generate test data. Here it is.

Heikki Linnakangas wrote:
> Attached is a new version of Simon's "scan-resistant buffer manager"
> patch. It's not ready for committing yet because of a small issue I
> found this morning (* see bottom), but here's a status update.
>
> To recap, the basic idea is to use a small ring of buffers for large
> scans like VACUUM, COPY and seq-scans. Changes to the original patch:
>
> - a different sized ring is used for VACUUMs and seq-scans, and COPY.
> VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096
> buffers in default configuration. See README changes in the patch for
> rationale.
>
> - for queries with large seqscans, the buffer ring is only used for
> reads issued by the seq scan, not for any other reads in the query.
> Typical scenario where this matters is doing a large seq scan with a
> nested loop join to a smaller table. You don't want to use the buffer
> ring for index lookups inside the nested loop.
>
> - for seqscans, drop buffers from the ring that would need a WAL flush
> to reuse. That makes bulk updates to behave roughly like they do without
> the patch, instead of having to do a WAL flush every 32 pages.
>
> I've spent a lot of time thinking of solutions to the last point. The
> obvious solution would be to not use the buffer ring for updating scans.
> The difficulty with that is that we don't know if a scan is read-only in
> heapam.c, where the hint to use the buffer ring is set.
>
> I've completed a set of performance tests on a test server. The server
> has 4 GB of RAM, of which 1 GB is used for shared_buffers.
>
> Results for a 10 GB table:
>
> head-copy-bigtable | 00:10:09.07016
> head-copy-bigtable | 00:10:20.507357
> head-copy-bigtable | 00:10:21.857677
> head-copy_nowal-bigtable | 00:05:18.232956
> head-copy_nowal-bigtable | 00:03:24.109047
> head-copy_nowal-bigtable | 00:05:31.019643
> head-select-bigtable | 00:03:47.102731
> head-select-bigtable | 00:01:08.314719
> head-select-bigtable | 00:01:08.238509
> head-select-bigtable | 00:01:08.208563
> head-select-bigtable | 00:01:08.28347
> head-select-bigtable | 00:01:08.308671
> head-vacuum_clean-bigtable | 00:01:04.227832
> head-vacuum_clean-bigtable | 00:01:04.232258
> head-vacuum_clean-bigtable | 00:01:04.294621
> head-vacuum_clean-bigtable | 00:01:04.280677
> head-vacuum_hintbits-bigtable | 00:04:01.123924
> head-vacuum_hintbits-bigtable | 00:03:58.253175
> head-vacuum_hintbits-bigtable | 00:04:26.318159
> head-vacuum_hintbits-bigtable | 00:04:37.512965
> patched-copy-bigtable | 00:09:52.776754
> patched-copy-bigtable | 00:10:18.185826
> patched-copy-bigtable | 00:10:16.975482
> patched-copy_nowal-bigtable | 00:03:14.882366
> patched-copy_nowal-bigtable | 00:04:01.04648
> patched-copy_nowal-bigtable | 00:03:56.062272
> patched-select-bigtable | 00:03:47.704154
> patched-select-bigtable | 00:01:08.460326
> patched-select-bigtable | 00:01:10.441544
> patched-select-bigtable | 00:01:11.916221
> patched-select-bigtable | 00:01:13.848038
> patched-select-bigtable | 00:01:10.956133
> patched-vacuum_clean-bigtable | 00:01:10.315439
> patched-vacuum_clean-bigtable | 00:01:12.210537
> patched-vacuum_clean-bigtable | 00:01:15.202114
> patched-vacuum_clean-bigtable | 00:01:10.712235
> patched-vacuum_hintbits-bigtable | 00:03:42.279201
> patched-vacuum_hintbits-bigtable | 00:04:02.057778
> patched-vacuum_hintbits-bigtable | 00:04:26.805822
> patched-vacuum_hintbits-bigtable | 00:04:28.911184
>
> In other words, the patch has no significant effect, as expected. The
> select times did go up by a couple of seconds, which I didn't expect,
> though. One theory is that unused shared_buffers are swapped out during
> the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and
> try again at some point.
>
> Results for a 2 GB table:
>
> copy-medsize-unpatched | 00:02:18.23246
> copy-medsize-unpatched | 00:02:22.347194
> copy-medsize-unpatched | 00:02:23.875874
> copy_nowal-medsize-unpatched | 00:01:27.606334
> copy_nowal-medsize-unpatched | 00:01:17.491243
> copy_nowal-medsize-unpatched | 00:01:31.902719
> select-medsize-unpatched | 00:00:03.786031
> select-medsize-unpatched | 00:00:02.678069
> select-medsize-unpatched | 00:00:02.666103
> select-medsize-unpatched | 00:00:02.673494
> select-medsize-unpatched | 00:00:02.669645
> select-medsize-unpatched | 00:00:02.666278
> vacuum_clean-medsize-unpatched | 00:00:01.091356
> vacuum_clean-medsize-unpatched | 00:00:01.923138
> vacuum_clean-medsize-unpatched | 00:00:01.917213
> vacuum_clean-medsize-unpatched | 00:00:01.917333
> vacuum_hintbits-medsize-unpatched | 00:00:01.683718
> vacuum_hintbits-medsize-unpatched | 00:00:01.864003
> vacuum_hintbits-medsize-unpatched | 00:00:03.186596
> vacuum_hintbits-medsize-unpatched | 00:00:02.16494
> copy-medsize-patched | 00:02:35.113501
> copy-medsize-patched | 00:02:25.269866
> copy-medsize-patched | 00:02:31.881089
> copy_nowal-medsize-patched | 00:01:00.254633
> copy_nowal-medsize-patched | 00:01:04.630687
> copy_nowal-medsize-patched | 00:01:03.729128
> select-medsize-patched | 00:00:03.201837
> select-medsize-patched | 00:00:01.332975
> select-medsize-patched | 00:00:01.33014
> select-medsize-patched | 00:00:01.332392
> select-medsize-patched | 00:00:01.333498
> select-medsize-patched | 00:00:01.332692
> vacuum_clean-medsize-patched | 00:00:01.140189
> vacuum_clean-medsize-patched | 00:00:01.062762
> vacuum_clean-medsize-patched | 00:00:01.062402
> vacuum_clean-medsize-patched | 00:00:01.07113
> vacuum_hintbits-medsize-patched | 00:00:17.865446
> vacuum_hintbits-medsize-patched | 00:00:15.162064
> vacuum_hintbits-medsize-patched | 00:00:01.704651
> vacuum_hintbits-medsize-patched | 00:00:02.671651
>
> This looks good to me, except for some glitch at the last
> vacuum_hintbits tests. Selects and vacuums benefit significantly, as
> does non-WAL-logged copy.
>
> Not shown here, but I run tests earlier with vacuum on a table that
> actually had dead tuples to be removed on it. In that test the patched
> version really shined, reducing the runtime to ~ 1/6th. That was the
> original motivation of this patch: not having to do a WAL flush on every
> page in the 2nd phase of vacuum.
>
> Test script attached. To use it:
>
> 1. Edit testscript.sh. Change BIGTABLESIZE.
> 2. Start postmaster
> 3. Run script, giving test-label as argument. For example:
> "./testscript.sh bigtable-patched"
>
> Attached is also the patch I used for the tests.
>
> I would appreciate it if people would download the patch and the script
> and repeat the tests on different hardware. I'm particularly interested
> in testing on a box with good I/O hardware where selects on unpatched
> PostgreSQL are bottlenecked by CPU.
>
> Barring any surprises I'm going to fix the remaining issue and submit a
> final patch, probably in the weekend.
>
> (*) The issue with this patch is that if the buffer cache is completely
> filled with dirty buffers that need a WAL flush to evict, the buffer
> ring code will get into an infinite loop trying to find one that doesn't
> need a WAL flush. Should be simple to fix.
>
>
> ------------------------------------------------------------------------
>
> Index: src/backend/access/heap/heapam.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
> retrieving revision 1.232
> diff -c -r1.232 heapam.c
> *** src/backend/access/heap/heapam.c 8 Apr 2007 01:26:27 -0000 1.232
> --- src/backend/access/heap/heapam.c 16 May 2007 11:35:14 -0000
> ***************
> *** 83,88 ****
> --- 83,96 ----
> */
> scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
>
> + /* A scan on a table smaller than shared_buffers is treated like random
> + * access, but bigger scans should use the bulk read replacement policy.
> + */
> + if (scan->rs_nblocks > NBuffers)
> + scan->rs_accesspattern = AP_BULKREAD;
> + else
> + scan->rs_accesspattern = AP_NORMAL;
> +
> scan->rs_inited = false;
> scan->rs_ctup.t_data = NULL;
> ItemPointerSetInvalid(&scan->rs_ctup.t_self);
> ***************
> *** 123,133 ****
> --- 131,146 ----
>
> Assert(page < scan->rs_nblocks);
>
> + /* Read the page with the right strategy */
> + SetAccessPattern(scan->rs_accesspattern);
> +
> scan->rs_cbuf = ReleaseAndReadBuffer(scan->rs_cbuf,
> scan->rs_rd,
> page);
> scan->rs_cblock = page;
>
> + SetAccessPattern(AP_NORMAL);
> +
> if (!scan->rs_pageatatime)
> return;
>
> Index: src/backend/access/transam/xlog.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
> retrieving revision 1.268
> diff -c -r1.268 xlog.c
> *** src/backend/access/transam/xlog.c 30 Apr 2007 21:01:52 -0000 1.268
> --- src/backend/access/transam/xlog.c 15 May 2007 16:23:30 -0000
> ***************
> *** 1668,1673 ****
> --- 1668,1700 ----
> }
>
> /*
> + * Returns true if 'record' hasn't been flushed to disk yet.
> + */
> + bool
> + XLogNeedsFlush(XLogRecPtr record)
> + {
> + /* Quick exit if already known flushed */
> + if (XLByteLE(record, LogwrtResult.Flush))
> + return false;
> +
> + /* read LogwrtResult and update local state */
> + {
> + /* use volatile pointer to prevent code rearrangement */
> + volatile XLogCtlData *xlogctl = XLogCtl;
> +
> + SpinLockAcquire(&xlogctl->info_lck);
> + LogwrtResult = xlogctl->LogwrtResult;
> + SpinLockRelease(&xlogctl->info_lck);
> + }
> +
> + /* check again */
> + if (XLByteLE(record, LogwrtResult.Flush))
> + return false;
> +
> + return true;
> + }
> +
> + /*
> * Ensure that all XLOG data through the given position is flushed to disk.
> *
> * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
> Index: src/backend/commands/copy.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
> retrieving revision 1.283
> diff -c -r1.283 copy.c
> *** src/backend/commands/copy.c 27 Apr 2007 22:05:46 -0000 1.283
> --- src/backend/commands/copy.c 15 May 2007 17:05:29 -0000
> ***************
> *** 1876,1881 ****
> --- 1876,1888 ----
> nfields = file_has_oids ? (attr_count + 1) : attr_count;
> field_strings = (char **) palloc(nfields * sizeof(char *));
>
> + /* Use the special COPY buffer replacement strategy if WAL-logging
> + * is enabled. If it's not, the pages we're writing are dirty but
> + * don't need a WAL flush to write out, so the BULKREAD strategy
> + * is more suitable.
> + */
> + SetAccessPattern(use_wal ? AP_COPY : AP_BULKREAD);
> +
> /* Initialize state variables */
> cstate->fe_eof = false;
> cstate->eol_type = EOL_UNKNOWN;
> ***************
> *** 2161,2166 ****
> --- 2168,2176 ----
> cstate->filename)));
> }
>
> + /* Reset buffer replacement strategy */
> + SetAccessPattern(AP_NORMAL);
> +
> /*
> * If we skipped writing WAL, then we need to sync the heap (but not
> * indexes since those use WAL anyway)
> Index: src/backend/commands/vacuum.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/vacuum.c,v
> retrieving revision 1.350
> diff -c -r1.350 vacuum.c
> *** src/backend/commands/vacuum.c 16 Apr 2007 18:29:50 -0000 1.350
> --- src/backend/commands/vacuum.c 15 May 2007 17:06:18 -0000
> ***************
> *** 421,431 ****
> * Tell the buffer replacement strategy that vacuum is causing
> * the IO
> */
> ! StrategyHintVacuum(true);
>
> analyze_rel(relid, vacstmt);
>
> ! StrategyHintVacuum(false);
>
> if (use_own_xacts)
> CommitTransactionCommand();
> --- 421,431 ----
> * Tell the buffer replacement strategy that vacuum is causing
> * the IO
> */
> ! SetAccessPattern(AP_VACUUM);
>
> analyze_rel(relid, vacstmt);
>
> ! SetAccessPattern(AP_NORMAL);
>
> if (use_own_xacts)
> CommitTransactionCommand();
> ***************
> *** 442,448 ****
> /* Make sure cost accounting is turned off after error */
> VacuumCostActive = false;
> /* And reset buffer replacement strategy, too */
> ! StrategyHintVacuum(false);
> PG_RE_THROW();
> }
> PG_END_TRY();
> --- 442,448 ----
> /* Make sure cost accounting is turned off after error */
> VacuumCostActive = false;
> /* And reset buffer replacement strategy, too */
> ! SetAccessPattern(AP_NORMAL);
> PG_RE_THROW();
> }
> PG_END_TRY();
> ***************
> *** 1088,1094 ****
> * Tell the cache replacement strategy that vacuum is causing all
> * following IO
> */
> ! StrategyHintVacuum(true);
>
> /*
> * Do the actual work --- either FULL or "lazy" vacuum
> --- 1088,1094 ----
> * Tell the cache replacement strategy that vacuum is causing all
> * following IO
> */
> ! SetAccessPattern(AP_VACUUM);
>
> /*
> * Do the actual work --- either FULL or "lazy" vacuum
> ***************
> *** 1098,1104 ****
> else
> lazy_vacuum_rel(onerel, vacstmt);
>
> ! StrategyHintVacuum(false);
>
> /* all done with this class, but hold lock until commit */
> relation_close(onerel, NoLock);
> --- 1098,1104 ----
> else
> lazy_vacuum_rel(onerel, vacstmt);
>
> ! SetAccessPattern(AP_NORMAL);
>
> /* all done with this class, but hold lock until commit */
> relation_close(onerel, NoLock);
> Index: src/backend/storage/buffer/README
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/README,v
> retrieving revision 1.11
> diff -c -r1.11 README
> *** src/backend/storage/buffer/README 23 Jul 2006 03:07:58 -0000 1.11
> --- src/backend/storage/buffer/README 16 May 2007 11:43:11 -0000
> ***************
> *** 152,159 ****
> a field to show which backend is doing its I/O).
>
>
> ! Buffer replacement strategy
> ! ---------------------------
>
> There is a "free list" of buffers that are prime candidates for replacement.
> In particular, buffers that are completely free (contain no valid page) are
> --- 152,159 ----
> a field to show which backend is doing its I/O).
>
>
> ! Normal buffer replacement strategy
> ! ----------------------------------
>
> There is a "free list" of buffers that are prime candidates for replacement.
> In particular, buffers that are completely free (contain no valid page) are
> ***************
> *** 199,221 ****
> have to give up and try another buffer. This however is not a concern
> of the basic select-a-victim-buffer algorithm.)
>
> - A special provision is that while running VACUUM, a backend does not
> - increment the usage count on buffers it accesses. In fact, if ReleaseBuffer
> - sees that it is dropping the pin count to zero and the usage count is zero,
> - then it appends the buffer to the tail of the free list. (This implies that
> - VACUUM, but only VACUUM, must take the BufFreelistLock during ReleaseBuffer;
> - this shouldn't create much of a contention problem.) This provision
> - encourages VACUUM to work in a relatively small number of buffers rather
> - than blowing out the entire buffer cache. It is reasonable since a page
> - that has been touched only by VACUUM is unlikely to be needed again soon.
> -
> - Since VACUUM usually requests many pages very fast, the effect of this is that
> - it will get back the very buffers it filled and possibly modified on the next
> - call and will therefore do its work in a few shared memory buffers, while
> - being able to use whatever it finds in the cache already. This also implies
> - that most of the write traffic caused by a VACUUM will be done by the VACUUM
> - itself and not pushed off onto other processes.
>
>
> Background writer's processing
> ------------------------------
> --- 199,243 ----
> have to give up and try another buffer. This however is not a concern
> of the basic select-a-victim-buffer algorithm.)
>
>
> + Buffer ring replacement strategy
> + ---------------------------------
> +
> + When running a query that needs to access a large number of pages, like VACUUM,
> + COPY, or a large sequential scan, a different strategy is used. A page that
> + has been touched only by such a scan is unlikely to be needed again soon, so
> + instead of running the normal clock sweep algorithm and blowing out the entire
> + buffer cache, a small ring of buffers is allocated using the normal clock sweep
> + algorithm and those buffers are reused for the whole scan. This also implies
> + that most of the write traffic caused by such a statement will be done by the
> + backend itself and not pushed off onto other processes.
> +
> + The size of the ring used depends on the kind of scan:
> +
> + For sequential scans, a small 256 KB ring is used. That's small enough to fit
> + in L2 cache, which makes transferring pages from OS cache to shared buffer
> + cache efficient. Even less would often be enough, but the ring must be big
> + enough to accommodate all pages in the scan that are pinned concurrently.
> + 256 KB should also be enough to leave a small cache trail for other backends to
> + join in a synchronized seq scan. If a buffer is dirtied and LSN set, the buffer
> + is removed from the ring and a replacement buffer is chosen using the normal
> + replacement strategy. In a scan that modifies every page in the scan, like a
> + bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and the
> + ring strategy effectively degrades to the normal strategy.
> +
> + VACUUM uses a 256 KB ring like sequential scans, but dirty pages are not
> + removed from the ring. WAL is flushed instead to allow reuse of the buffers.
> + Before introducing the buffer ring strategy in 8.3, buffers were put to the
> + freelist, which was effectively a buffer ring of 1 buffer.
> +
> + COPY behaves like VACUUM, but a much larger ring is used. The ring size is
> + chosen to be twice the WAL segment size. This avoids polluting the buffer cache
> + like the clock sweep would do, and using a ring larger than WAL segment size
> + avoids having to do any extra WAL flushes, since a WAL segment will always be
> + filled, forcing a WAL flush, before looping through the buffer ring and bumping
> + into a buffer that would force a WAL flush. However, for non-WAL-logged COPY
> + operations the smaller 256 KB ring is used because WAL flushes are not needed
> + to write the buffers.
>
> Background writer's processing
> ------------------------------
> Index: src/backend/storage/buffer/bufmgr.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
> retrieving revision 1.218
> diff -c -r1.218 bufmgr.c
> *** src/backend/storage/buffer/bufmgr.c 2 May 2007 23:34:48 -0000 1.218
> --- src/backend/storage/buffer/bufmgr.c 16 May 2007 12:34:10 -0000
> ***************
> *** 419,431 ****
> /* Loop here in case we have to try another victim buffer */
> for (;;)
> {
> /*
> * Select a victim buffer. The buffer is returned with its header
> * spinlock still held! Also the BufFreelistLock is still held, since
> * it would be bad to hold the spinlock while possibly waking up other
> * processes.
> */
> ! buf = StrategyGetBuffer();
>
> Assert(buf->refcount == 0);
>
> --- 419,433 ----
> /* Loop here in case we have to try another victim buffer */
> for (;;)
> {
> + bool lock_held;
> +
> /*
> * Select a victim buffer. The buffer is returned with its header
> * spinlock still held! Also the BufFreelistLock is still held, since
> * it would be bad to hold the spinlock while possibly waking up other
> * processes.
> */
> ! buf = StrategyGetBuffer(&lock_held);
>
> Assert(buf->refcount == 0);
>
> ***************
> *** 436,442 ****
> PinBuffer_Locked(buf);
>
> /* Now it's safe to release the freelist lock */
> ! LWLockRelease(BufFreelistLock);
>
> /*
> * If the buffer was dirty, try to write it out. There is a race
> --- 438,445 ----
> PinBuffer_Locked(buf);
>
> /* Now it's safe to release the freelist lock */
> ! if (lock_held)
> ! LWLockRelease(BufFreelistLock);
>
> /*
> * If the buffer was dirty, try to write it out. There is a race
> ***************
> *** 464,469 ****
> --- 467,489 ----
> */
> if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
> {
> + /* In BULKREAD-mode, check if a WAL flush would be needed to
> + * evict this buffer. If so, ask the replacement strategy if
> + * we should go ahead and do it or choose another victim.
> + */
> + if (active_access_pattern == AP_BULKREAD)
> + {
> + if (XLogNeedsFlush(BufferGetLSN(buf)))
> + {
> + if (StrategyRejectBuffer(buf))
> + {
> + LWLockRelease(buf->content_lock);
> + UnpinBuffer(buf, true, false);
> + continue;
> + }
> + }
> + }
> +
> FlushBuffer(buf, NULL);
> LWLockRelease(buf->content_lock);
> }
> ***************
> *** 925,932 ****
> PrivateRefCount[b]--;
> if (PrivateRefCount[b] == 0)
> {
> - bool immed_free_buffer = false;
> -
> /* I'd better not still hold any locks on the buffer */
> Assert(!LWLockHeldByMe(buf->content_lock));
> Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
> --- 945,950 ----
> ***************
> *** 940,956 ****
> /* Update buffer usage info, unless this is an internal access */
> if (normalAccess)
> {
> ! if (!strategy_hint_vacuum)
> {
> ! if (buf->usage_count < BM_MAX_USAGE_COUNT)
> ! buf->usage_count++;
> }
> else
> ! {
> ! /* VACUUM accesses don't bump usage count, instead... */
> ! if (buf->refcount == 0 && buf->usage_count == 0)
> ! immed_free_buffer = true;
> ! }
> }
>
> if ((buf->flags & BM_PIN_COUNT_WAITER) &&
> --- 958,975 ----
> /* Update buffer usage info, unless this is an internal access */
> if (normalAccess)
> {
> ! if (active_access_pattern != AP_NORMAL)
> {
> ! /* We don't want large one-off scans like vacuum to inflate
> ! * the usage_count. We do want to set it to 1, though, to keep
> ! * other backends from hijacking it from the buffer ring.
> ! */
> ! if (buf->usage_count == 0)
> ! buf->usage_count = 1;
> }
> else
> ! if (buf->usage_count < BM_MAX_USAGE_COUNT)
> ! buf->usage_count++;
> }
>
> if ((buf->flags & BM_PIN_COUNT_WAITER) &&
> ***************
> *** 965,978 ****
> }
> else
> UnlockBufHdr(buf);
> -
> - /*
> - * If VACUUM is releasing an otherwise-unused buffer, send it to the
> - * freelist for near-term reuse. We put it at the tail so that it
> - * won't be used before any invalid buffers that may exist.
> - */
> - if (immed_free_buffer)
> - StrategyFreeBuffer(buf, false);
> }
> }
>
> --- 984,989 ----
> Index: src/backend/storage/buffer/freelist.c
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/freelist.c,v
> retrieving revision 1.58
> diff -c -r1.58 freelist.c
> *** src/backend/storage/buffer/freelist.c 5 Jan 2007 22:19:37 -0000 1.58
> --- src/backend/storage/buffer/freelist.c 17 May 2007 16:12:56 -0000
> ***************
> *** 18,23 ****
> --- 18,25 ----
> #include "storage/buf_internals.h"
> #include "storage/bufmgr.h"
>
> + #include "utils/memutils.h"
> +
>
> /*
> * The shared freelist control information.
> ***************
> *** 39,47 ****
> /* Pointers to shared state */
> static BufferStrategyControl *StrategyControl = NULL;
>
> ! /* Backend-local state about whether currently vacuuming */
> ! bool strategy_hint_vacuum = false;
>
>
> /*
> * StrategyGetBuffer
> --- 41,53 ----
> /* Pointers to shared state */
> static BufferStrategyControl *StrategyControl = NULL;
>
> ! /* Currently active access pattern hint. */
> ! AccessPattern active_access_pattern = AP_NORMAL;
>
> + /* prototypes for internal functions */
> + static volatile BufferDesc *GetBufferFromRing(void);
> + static void PutBufferToRing(volatile BufferDesc *buf);
> + static void InitRing(void);
>
> /*
> * StrategyGetBuffer
> ***************
> *** 51,67 ****
> * the selected buffer must not currently be pinned by anyone.
> *
> * To ensure that no one else can pin the buffer before we do, we must
> ! * return the buffer with the buffer header spinlock still held. That
> ! * means that we return with the BufFreelistLock still held, as well;
> ! * the caller must release that lock once the spinlock is dropped.
> */
> volatile BufferDesc *
> ! StrategyGetBuffer(void)
> {
> volatile BufferDesc *buf;
> int trycounter;
>
> LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
>
> /*
> * Try to get a buffer from the freelist. Note that the freeNext fields
> --- 57,89 ----
> * the selected buffer must not currently be pinned by anyone.
> *
> * To ensure that no one else can pin the buffer before we do, we must
> ! * return the buffer with the buffer header spinlock still held. If
> ! * *lock_held is set at return, we return with the BufFreelistLock still
> ! * held, as well; the caller must release that lock once the spinlock is
> ! * dropped.
> */
> volatile BufferDesc *
> ! StrategyGetBuffer(bool *lock_held)
> {
> volatile BufferDesc *buf;
> int trycounter;
>
> + /* Get a buffer from the ring if we're doing a bulk scan */
> + if (active_access_pattern != AP_NORMAL)
> + {
> + buf = GetBufferFromRing();
> + if (buf != NULL)
> + {
> + *lock_held = false;
> + return buf;
> + }
> + }
> +
> + /*
> + * If our selected buffer wasn't available, pick another...
> + */
> LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
> + *lock_held = true;
>
> /*
> * Try to get a buffer from the freelist. Note that the freeNext fields
> ***************
> *** 86,96 ****
> */
> LockBufHdr(buf);
> if (buf->refcount == 0 && buf->usage_count == 0)
> return buf;
> UnlockBufHdr(buf);
> }
>
> ! /* Nothing on the freelist, so run the "clock sweep" algorithm */
> trycounter = NBuffers;
> for (;;)
> {
> --- 108,122 ----
> */
> LockBufHdr(buf);
> if (buf->refcount == 0 && buf->usage_count == 0)
> + {
> + if (active_access_pattern != AP_NORMAL)
> + PutBufferToRing(buf);
> return buf;
> + }
> UnlockBufHdr(buf);
> }
>
> ! /* Nothing on the freelist, so run the shared "clock sweep" algorithm */
> trycounter = NBuffers;
> for (;;)
> {
> ***************
> *** 105,111 ****
> --- 131,141 ----
> */
> LockBufHdr(buf);
> if (buf->refcount == 0 && buf->usage_count == 0)
> + {
> + if (active_access_pattern != AP_NORMAL)
> + PutBufferToRing(buf);
> return buf;
> + }
> if (buf->usage_count > 0)
> {
> buf->usage_count--;
> ***************
> *** 191,204 ****
> }
>
> /*
> ! * StrategyHintVacuum -- tell us whether VACUUM is active
> */
> void
> ! StrategyHintVacuum(bool vacuum_active)
> {
> ! strategy_hint_vacuum = vacuum_active;
> ! }
>
>
> /*
> * StrategyShmemSize
> --- 221,245 ----
> }
>
> /*
> ! * SetAccessPattern -- Sets the active access pattern hint
> ! *
> ! * Caller is responsible for resetting the hint to AP_NORMAL after the bulk
> ! * operation is done. It's ok to switch repeatedly between AP_NORMAL and one of
> ! * the other strategies, for example in a query with one large sequential scan
> ! * nested loop joined to an index scan. Index tuples should be fetched with the
> ! * normal strategy and the pages from the seq scan should be read in with the
> ! * AP_BULKREAD strategy. The ring won't be affected by such switching, however
> ! * switching to an access pattern with different ring size will invalidate the
> ! * old ring.
> */
> void
> ! SetAccessPattern(AccessPattern new_pattern)
> {
> ! active_access_pattern = new_pattern;
>
> + if (active_access_pattern != AP_NORMAL)
> + InitRing();
> + }
>
> /*
> * StrategyShmemSize
> ***************
> *** 274,276 ****
> --- 315,498 ----
> else
> Assert(!init);
> }
> +
> + /* ----------------------------------------------------------------
> + * Backend-private buffer ring management
> + * ----------------------------------------------------------------
> + */
> +
> + /*
> + * Ring sizes for different access patterns. See README for the rationale
> + * of these.
> + */
> + #define BULKREAD_RING_SIZE 256 * 1024 / BLCKSZ
> + #define VACUUM_RING_SIZE 256 * 1024 / BLCKSZ
> + #define COPY_RING_SIZE Min(NBuffers / 8, (XLOG_SEG_SIZE / BLCKSZ) * 2)
> +
> + /*
> + * BufferRing is an array of buffer ids, and RingSize it's size in number of
> + * elements. It's allocated in TopMemoryContext the first time it's needed.
> + */
> + static int *BufferRing = NULL;
> + static int RingSize = 0;
> +
> + /* Index of the "current" slot in the ring. It's advanced every time a buffer
> + * is handed out from the ring with GetBufferFromRing and it points to the
> + * last buffer returned from the ring. RingCurSlot + 1 is the next victim
> + * GetBufferRing will hand out.
> + */
> + static int RingCurSlot = 0;
> +
> + /* magic value to mark empty slots in the ring */
> + #define BUF_ID_NOT_SET -1
> +
> +
> + /*
> + * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
> + * ring is empty.
> + *
> + * The bufhdr spin lock is held on the returned buffer.
> + */
> + static volatile BufferDesc *
> + GetBufferFromRing(void)
> + {
> + volatile BufferDesc *buf;
> +
> + /* ring should be initialized by now */
> + Assert(RingSize > 0 && BufferRing != NULL);
> +
> + /* Run private "clock cycle" */
> + if (++RingCurSlot >= RingSize)
> + RingCurSlot = 0;
> +
> + /*
> + * If that slot hasn't been filled yet, tell the caller to allocate
> + * a new buffer with the normal allocation strategy. He will then
> + * fill this slot by calling PutBufferToRing with the new buffer.
> + */
> + if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
> + return NULL;
> +
> + buf = &BufferDescriptors[BufferRing[RingCurSlot]];
> +
> + /*
> + * If the buffer is pinned we cannot use it under any circumstances.
> + * If usage_count == 0 then the buffer is fair game.
> + *
> + * We also choose this buffer if usage_count == 1. Strictly, this
> + * might sometimes be the wrong thing to do, but we rely on the high
> + * probability that it was this process that last touched the buffer.
> + * If it wasn't, we'll choose a suboptimal victim, but it shouldn't
> + * make any difference in the big scheme of things.
> + *
> + */
> + LockBufHdr(buf);
> + if (buf->refcount == 0 && buf->usage_count <= 1)
> + return buf;
> + UnlockBufHdr(buf);
> +
> + return NULL;
> + }
> +
> + /*
> + * PutBufferToRing -- adds a buffer to the buffer ring
> + *
> + * Caller must hold the buffer header spinlock on the buffer.
> + */
> + static void
> + PutBufferToRing(volatile BufferDesc *buf)
> + {
> + /* ring should be initialized by now */
> + Assert(RingSize > 0 && BufferRing != NULL);
> +
> + if (BufferRing[RingCurSlot] == BUF_ID_NOT_SET)
> + BufferRing[RingCurSlot] = buf->buf_id;
> + }
> +
> + /*
> + * Initializes a ring buffer with correct size for the currently
> + * active strategy. Does nothing if the ring already has the right size.
> + */
> + static void
> + InitRing(void)
> + {
> + int new_size;
> + int old_size = RingSize;
> + int i;
> + MemoryContext oldcxt;
> +
> + /* Determine new size */
> +
> + switch(active_access_pattern)
> + {
> + case AP_BULKREAD:
> + new_size = BULKREAD_RING_SIZE;
> + break;
> + case AP_COPY:
> + new_size = COPY_RING_SIZE;
> + break;
> + case AP_VACUUM:
> + new_size = VACUUM_RING_SIZE;
> + break;
> + default:
> + elog(ERROR, "unexpected buffer cache strategy %d",
> + active_access_pattern);
> + return; /* keep compile happy */
> + }
> +
> + /*
> + * Seq scans set and reset the strategy on every page, so we better exit
> + * quickly if no change in size is needed.
> + */
> + if (new_size == old_size)
> + return;
> +
> + /* Allocate array */
> +
> + oldcxt = MemoryContextSwitchTo(TopMemoryContext);
> +
> + if (old_size == 0)
> + {
> + Assert(BufferRing == NULL);
> + BufferRing = palloc(new_size * sizeof(int));
> + }
> + else
> + BufferRing = repalloc(BufferRing, new_size * sizeof(int));
> +
> + MemoryContextSwitchTo(oldcxt);
> +
> + for(i = 0; i < new_size; i++)
> + BufferRing[i] = BUF_ID_NOT_SET;
> +
> + RingCurSlot = 0;
> + RingSize = new_size;
> + }
> +
> + /*
> + * Buffer manager calls this function in AP_BULKREAD mode when the
> + * buffer handed to it turns out to need a WAL flush to write out. This
> + * gives the strategy a second chance to choose another victim.
> + *
> + * Returns true if buffer manager should ask for a new victim, and false
> + * if WAL should be flushed and this buffer used.
> + */
> + bool
> + StrategyRejectBuffer(volatile BufferDesc *buf)
> + {
> + Assert(RingSize > 0);
> +
> + if (BufferRing[RingCurSlot] == buf->buf_id)
> + {
> + BufferRing[RingCurSlot] = BUF_ID_NOT_SET;
> + return true;
> + }
> + else
> + {
> + /* Apparently the buffer didn't come from the ring. We don't want to
> + * mess with how the clock sweep works; in worst case there's no
> + * buffers in the buffer cache that can be reused without a WAL flush,
> + * and we'd get into an endless loop trying.
> + */
> + return false;
> + }
> + }
> Index: src/include/access/relscan.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/relscan.h,v
> retrieving revision 1.52
> diff -c -r1.52 relscan.h
> *** src/include/access/relscan.h 20 Jan 2007 18:43:35 -0000 1.52
> --- src/include/access/relscan.h 15 May 2007 17:01:31 -0000
> ***************
> *** 28,33 ****
> --- 28,34 ----
> ScanKey rs_key; /* array of scan key descriptors */
> BlockNumber rs_nblocks; /* number of blocks to scan */
> bool rs_pageatatime; /* verify visibility page-at-a-time? */
> + AccessPattern rs_accesspattern; /* access pattern to use for reads */
>
> /* scan current state */
> bool rs_inited; /* false = scan not init'd yet */
> Index: src/include/access/xlog.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
> retrieving revision 1.76
> diff -c -r1.76 xlog.h
> *** src/include/access/xlog.h 5 Jan 2007 22:19:51 -0000 1.76
> --- src/include/access/xlog.h 14 May 2007 21:22:40 -0000
> ***************
> *** 151,156 ****
> --- 151,157 ----
>
> extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
> extern void XLogFlush(XLogRecPtr RecPtr);
> + extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
>
> extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
> extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
> Index: src/include/storage/buf_internals.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/buf_internals.h,v
> retrieving revision 1.89
> diff -c -r1.89 buf_internals.h
> *** src/include/storage/buf_internals.h 5 Jan 2007 22:19:57 -0000 1.89
> --- src/include/storage/buf_internals.h 15 May 2007 17:07:59 -0000
> ***************
> *** 16,21 ****
> --- 16,22 ----
> #define BUFMGR_INTERNALS_H
>
> #include "storage/buf.h"
> + #include "storage/bufmgr.h"
> #include "storage/lwlock.h"
> #include "storage/shmem.h"
> #include "storage/spin.h"
> ***************
> *** 168,174 ****
> extern BufferDesc *LocalBufferDescriptors;
>
> /* in freelist.c */
> ! extern bool strategy_hint_vacuum;
>
> /* event counters in buf_init.c */
> extern long int ReadBufferCount;
> --- 169,175 ----
> extern BufferDesc *LocalBufferDescriptors;
>
> /* in freelist.c */
> ! extern AccessPattern active_access_pattern;
>
> /* event counters in buf_init.c */
> extern long int ReadBufferCount;
> ***************
> *** 184,195 ****
> */
>
> /* freelist.c */
> ! extern volatile BufferDesc *StrategyGetBuffer(void);
> extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
> extern int StrategySyncStart(void);
> extern Size StrategyShmemSize(void);
> extern void StrategyInitialize(bool init);
>
> /* buf_table.c */
> extern Size BufTableShmemSize(int size);
> extern void InitBufTable(int size);
> --- 185,198 ----
> */
>
> /* freelist.c */
> ! extern volatile BufferDesc *StrategyGetBuffer(bool *lock_held);
> extern void StrategyFreeBuffer(volatile BufferDesc *buf, bool at_head);
> extern int StrategySyncStart(void);
> extern Size StrategyShmemSize(void);
> extern void StrategyInitialize(bool init);
>
> + extern bool StrategyRejectBuffer(volatile BufferDesc *buf);
> +
> /* buf_table.c */
> extern Size BufTableShmemSize(int size);
> extern void InitBufTable(int size);
> Index: src/include/storage/bufmgr.h
> ===================================================================
> RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
> retrieving revision 1.103
> diff -c -r1.103 bufmgr.h
> *** src/include/storage/bufmgr.h 2 May 2007 23:18:03 -0000 1.103
> --- src/include/storage/bufmgr.h 15 May 2007 17:07:02 -0000
> ***************
> *** 48,53 ****
> --- 48,61 ----
> #define BUFFER_LOCK_SHARE 1
> #define BUFFER_LOCK_EXCLUSIVE 2
>
> + typedef enum AccessPattern
> + {
> + AP_NORMAL, /* Normal random access */
> + AP_BULKREAD, /* Large read-only scan (hint bit updates are ok) */
> + AP_COPY, /* Large updating scan, like COPY with WAL enabled */
> + AP_VACUUM, /* VACUUM */
> + } AccessPattern;
> +
> /*
> * These routines are beaten on quite heavily, hence the macroization.
> */
> ***************
> *** 157,162 ****
> extern void AtProcExit_LocalBuffers(void);
>
> /* in freelist.c */
> ! extern void StrategyHintVacuum(bool vacuum_active);
>
> #endif
> --- 165,170 ----
> extern void AtProcExit_LocalBuffers(void);
>
> /* in freelist.c */
> ! extern void SetAccessPattern(AccessPattern new_pattern);
>
> #endif

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
gendata.c	text/x-csrc	406 bytes

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-22 14:30:26
Message-ID:	200705221430.l4MEUQS13806@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Great. Based on this, do you have a patch that is ready to apply.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > In any case, I'd like to see more test results before we make a
> > decision. I'm running tests with DBT-2 and a seq scan running in the
> > background to see if the cache-spoiling effect shows up. I'm also trying
> > to get hold of some bigger hardware to run on. Running these tests takes
> > some calendar time, but the hard work has already been done. I'm going
> > to start reviewing Jeff's synchronized scans patch now.
>
> Here are the results of the DBT-2 tests:
>
> http://community.enterprisedb.com/seqscan/imola/
>
> In each of these tests, at the end of rampup a script is started that
> issues a "SELECT COUNT(*) FROM stock" in a loop, with 2 minute delay
> between end of previous query and start of next one.
>
> The patch makes the seq scans go significantly faster. In the 1 hour
> test period, the patched tests perform roughly 30-100% as many selects
> as unpatched tests.
>
> With 100 and 105 warehouses, it also significantly reduces the impact of
> the seq scan on other queries; response times are lower with the patch.
> With 120 warehouses the reduction of impact is not as clear, but when
> you plot the response times it's still there (the plots on the "response
> times charts"-page are useless because they're overwhelmed by the
> checkpoint spike).
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-22 15:06:11
Message-ID:	465306E3.60902@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Not yet, there's still one issue that needs fixing.

Bruce Momjian wrote:
> Great. Based on this, do you have a patch that is ready to apply.
>
> ---------------------------------------------------------------------------
>
> Heikki Linnakangas wrote:
>> Heikki Linnakangas wrote:
>>> In any case, I'd like to see more test results before we make a
>>> decision. I'm running tests with DBT-2 and a seq scan running in the
>>> background to see if the cache-spoiling effect shows up. I'm also trying
>>> to get hold of some bigger hardware to run on. Running these tests takes
>>> some calendar time, but the hard work has already been done. I'm going
>>> to start reviewing Jeff's synchronized scans patch now.
>> Here are the results of the DBT-2 tests:
>>
>> http://community.enterprisedb.com/seqscan/imola/
>>
>> In each of these tests, at the end of rampup a script is started that
>> issues a "SELECT COUNT(*) FROM stock" in a loop, with 2 minute delay
>> between end of previous query and start of next one.
>>
>> The patch makes the seq scans go significantly faster. In the 1 hour
>> test period, the patched tests perform roughly 30-100% as many selects
>> as unpatched tests.
>>
>> With 100 and 105 warehouses, it also significantly reduces the impact of
>> the seq scan on other queries; response times are lower with the patch.
>> With 120 warehouses the reduction of impact is not as clear, but when
>> you plot the response times it's still there (the plots on the "response
>> times charts"-page are useless because they're overwhelmed by the
>> checkpoint spike).
>>
>> --
>> Heikki Linnakangas
>> EnterpriseDB http://www.enterprisedb.com
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 5: don't forget to increase your free space map settings
>

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-25 19:30:35
Message-ID:	4657395B.9000706@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Here's a new version, all known issues are now fixed. I'm now happy with
this patch.

Next, I'll start looking at the latest version of Jeff's synchronized
scans patch.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
sr-fixed-8.patch	text/x-diff	29.7 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-25 20:36:59
Message-ID:	21529.1180125419@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Here's a new version, all known issues are now fixed. I'm now happy with
> this patch.
> Next, I'll start looking at the latest version of Jeff's synchronized
> scans patch.

I'm a bit confused --- weren't you intending to review these in parallel
because of the possible interactions? Do you think this should be
applied now, or does it need to wait to see what happens with Jeff's
patch?

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-26 10:28:33
Message-ID:	46580BD1.10505@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Here's a new version, all known issues are now fixed. I'm now happy with
>> this patch.
>> Next, I'll start looking at the latest version of Jeff's synchronized
>> scans patch.
>
> I'm a bit confused --- weren't you intending to review these in parallel
> because of the possible interactions? Do you think this should be
> applied now, or does it need to wait to see what happens with Jeff's
> patch?

I think it should be applied now. I've briefly looked at Jeff's patch
and I don't see any problems looming there.

Jeff performed tests with Simon's original patch and his patch, and I
think the results from those tests are still valid since the basic
behavior hasn't been changed. I'll repeat those tests myself, and run
some more to see if the added CPU overhead shows up in tests, but I'm
pretty confident the patches will work together as advertised.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-28 18:57:59
Message-ID:	7521.1180378679@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Here's a new version, all known issues are now fixed. I'm now happy with
> this patch.

I'm looking this over and finding it fairly ugly from a
system-structural point of view. In particular, this has pushed the
single-global-variable StrategyHintVacuum() concept well past the
breaking point. That API was sort of tolerable for vacuum, since vacuum
isn't re-entrant and can't coexist with any other behavior within the
same backend; though I never liked the fact that it affected vacuum's
system-catalog accesses along with the intended table fetches. But now
you've got bits of code hacking the access pattern in contexts that are
far less predictable than vacuum ... and not guaranteeing to reset the
access pattern on failure, either.

I think we've got to get rid of the global variable and make the access
pattern be a parameter to StrategyGetBuffer, instead. Which in turn
suggests a variant ReadBuffer, maybe ReadBufferWithStrategy(), at the
next level up. This would serve directly for the heapscan case, and
for the main heap accesses in vacuum/analyze, but it might be a bit
of a pain to make index vacuuming pay attention to the strategy.
The other case that the patch addresses is COPY TO REL, which we could
handle if we were willing to pass a strategy parameter down through
heap_insert and RelationGetBufferForTuple ... is it worth the trouble?
I don't recall anyone having presented measurements to show COPY TO REL
being helped by this patch.

I don't especially care for the ring data structure being global inside
freelist.c, either. What I'm inclined to do is have the "access
strategy parameter" actually be a pointer to some struct type, with NULL
representing the default strategy. At the beginning of a bulk operation
we'd create an access strategy object, which would be internally the
current Ring data structure, but it'd be opaque to code outside
freelist.c. This'd mean that a query using two large seqscans would use
two rings not one, but that's not necessarily bad; it would in any case
make it a lot easier to generalize the strategy concept in future to
support more than one kind of non-default policy being active in
different parts of a query.

A point I have not figured out how to deal with is that in the patch as
given, UnpinBuffer needs to know the strategy; and getting it that info
would make the changes far more invasive. But the patch's behavior here
is pretty risky anyway, since the strategy global at the time of unpin
might have nothing to do with how it was set when the buffer was
acquired. What I'm tempted to do is remove the special case there and
adjust buffer acquisition instead (maybe make it decrement the
usage_count when re-using a buffer from the ring).

Comments? I'm still playing with the ideas at this point...

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-28 19:59:43
Message-ID:	465B34AF.2020004@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Tom Lane wrote:
> ... and not guaranteeing to reset theaccess pattern on failure, either.

Good catch, I thought I had that covered but apparently not.

> I think we've got to get rid of the global variable and make the access
> pattern be a parameter to StrategyGetBuffer, instead. Which in turn
> suggests a variant ReadBuffer, maybe ReadBufferWithStrategy(), at the
> next level up. This would serve directly for the heapscan case, and
> for the main heap accesses in vacuum/analyze, but it might be a bit
> of a pain to make index vacuuming pay attention to the strategy.

I thought of that as well, but settled on the global variable for that
same reason, the UnpinBuffer problem, and the fact that we'd end up with
quite many different ReadBuffer variants: ReadBuffer, ReadOrZeroBuffer,
ReadBufferWithStrategy, ReleaseAndReadBuffer,
ReleaseAndReadBufferWithStrategy (if we care about that).

Index vacuum code for all the indexams could be changed to use the
ReadBufferWithStrategy if we took that approach, though there's quite a
many of those calls in total. We could use a hybrid approach, where you
could use ReadBufferWithStrategy to give the strategy on page-by-page
basis, or use SetStrategyHintVacuum/SetAccessPattern to set it for all
subsequent ReadBuffer-calls.

> The other case that the patch addresses is COPY TO REL, which we could
> handle if we were willing to pass a strategy parameter down through
> heap_insert and RelationGetBufferForTuple ... is it worth the trouble?
> I don't recall anyone having presented measurements to show COPY TO REL
> being helped by this patch.

The theory was that it'll reduce the cache-spoiling of COPY, just like
it does for selects. I can run tests on that.

> I don't especially care for the ring data structure being global inside
> freelist.c, either. What I'm inclined to do is have the "access
> strategy parameter" actually be a pointer to some struct type, with NULL
> representing the default strategy. At the beginning of a bulk operation
> we'd create an access strategy object, which would be internally the
> current Ring data structure, but it'd be opaque to code outside
> freelist.c. This'd mean that a query using two large seqscans would use
> two rings not one, but that's not necessarily bad; it would in any case
> make it a lot easier to generalize the strategy concept in future to
> support more than one kind of non-default policy being active in
> different parts of a query.

I thought about the two large scans scenario but came to the conclusion
that it's not possible to have two large scans active at the same time
in a single backend. You could have a query like "SELECT * FROM
bigtable_1, bigtable_2" with a cartesian product, or something with a
function that scans a large table, but even then one of the scans would
progress veery slowly compared to the other scan, and using more than
one ring buffer would make no difference.

I agree on the generalization argument, though.

> A point I have not figured out how to deal with is that in the patch as
> given, UnpinBuffer needs to know the strategy; and getting it that info
> would make the changes far more invasive. But the patch's behavior here
> is pretty risky anyway, since the strategy global at the time of unpin
> might have nothing to do with how it was set when the buffer was
> acquired. What I'm tempted to do is remove the special case there and
> adjust buffer acquisition instead (maybe make it decrement the
> usage_count when re-using a buffer from the ring).

It's assumed that the same strategy is used when unpinning, which is
true for the current usage (and apparently needs to be documented).

Normally buffers that are in the ring are recycled quickly, in which
case the usage_count will always be increased from 0->1 in UnpinBuffer,
regardless of the check. The check is there to avoid inflating the
usage_count on buffers that happened to be already in shared_buffers. It
might not be that important for the selects, but that's what keeps COPY
from bumping the usage_count all the way up to 5.

If we figure out another way to deal with the COPY usage_count, maybe we
could remove the check altogether. I've been thinking of changing COPY
anyway so that it always kept the last page it inserted to pinned, to
avoid the traffic to buffer manager on each tuple.

> Comments? I'm still playing with the ideas at this point...

Let me know if you want me to make changes and submit a new version.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-28 20:48:41
Message-ID:	10095.1180385321@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> A point I have not figured out how to deal with is that in the patch as
>> given, UnpinBuffer needs to know the strategy; and getting it that info
>> would make the changes far more invasive. But the patch's behavior here
>> is pretty risky anyway, since the strategy global at the time of unpin
>> might have nothing to do with how it was set when the buffer was
>> acquired.

> It's assumed that the same strategy is used when unpinning, which is
> true for the current usage (and apparently needs to be documented).

I don't believe that for a moment. Even in the trivial heapscan case,
the last pin is typically dropped during a tupleslot clear operation
sometime after the scan itself has moved on to the next page. In a more
general context (such as parallel heapscan and indexscan on a rel) it's
certainly too risky to assume it.

> Normally buffers that are in the ring are recycled quickly, in which
> case the usage_count will always be increased from 0->1 in UnpinBuffer,
> regardless of the check. The check is there to avoid inflating the
> usage_count on buffers that happened to be already in shared_buffers.

A heapscan would pin the buffer only once and hence bump its count at
most once, so I don't see a big problem here. Also, I'd argue that
buffers that had a positive usage_count shouldn't get sucked into the
ring to begin with.

> If we figure out another way to deal with the COPY usage_count, maybe we
> could remove the check altogether. I've been thinking of changing COPY
> anyway so that it always kept the last page it inserted to pinned, to
> avoid the traffic to buffer manager on each tuple.

This is getting fairly invasive for a part of the patch you've not
justified at all yet...

> Let me know if you want me to make changes and submit a new version.

I can work on this stuff; please do the tests to show whether it's worth
handling COPY TO REL, and keep on with Jeff's patch.

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-28 21:14:49
Message-ID:	465B4649.401@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> It's assumed that the same strategy is used when unpinning, which is
>> true for the current usage (and apparently needs to be documented).
>
> I don't believe that for a moment. Even in the trivial heapscan case,
> the last pin is typically dropped during a tupleslot clear operation
> sometime after the scan itself has moved on to the next page. In a more
> general context (such as parallel heapscan and indexscan on a rel) it's
> certainly too risky to assume it.

Hmm, I guess you're right. :(

One idea is to keep track which pins are taken using the bulk strategy.
It's a bit tricky when a buffer is pinned multiple times since we don't
know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we
could get away with just a flag per pinned buffer. Set the flag when a
buffer is pinned with bulk strategy and it wasn't pinned by us before,
and clear it when it's pinned with another strategy. I'm thinking we
steal one bit from PrivateRefCount for this.

>> Normally buffers that are in the ring are recycled quickly, in which
>> case the usage_count will always be increased from 0->1 in UnpinBuffer,
>> regardless of the check. The check is there to avoid inflating the
>> usage_count on buffers that happened to be already in shared_buffers.
>
> A heapscan would pin the buffer only once and hence bump its count at
> most once, so I don't see a big problem here. Also, I'd argue that
> buffers that had a positive usage_count shouldn't get sucked into the
> ring to begin with.

True, except that with the synchronized scans patch two synchronized
scans will pin the buffer twice.

>> If we figure out another way to deal with the COPY usage_count, maybe we
>> could remove the check altogether. I've been thinking of changing COPY
>> anyway so that it always kept the last page it inserted to pinned, to
>> avoid the traffic to buffer manager on each tuple.
>
> This is getting fairly invasive for a part of the patch you've not
> justified at all yet...
>
>> Let me know if you want me to make changes and submit a new version.
>
> I can work on this stuff; please do the tests to show whether it's worth
> handling COPY TO REL, and keep on with Jeff's patch.

Ok.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-28 21:36:39
Message-ID:	10742.1180388199@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> One idea is to keep track which pins are taken using the bulk strategy.
> It's a bit tricky when a buffer is pinned multiple times since we don't
> know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we
> could get away with just a flag per pinned buffer. Set the flag when a
> buffer is pinned with bulk strategy and it wasn't pinned by us before,
> and clear it when it's pinned with another strategy. I'm thinking we
> steal one bit from PrivateRefCount for this.

Seems like a mess. Why don't we just fix it so there's no need for
different behavior at Unpin time? The facts on the ground are that
the current patch's change in UnpinBuffer is a no-op anyway, because
of the tupletableslot interference.

The behavior I'm imagining is just that when we try to take a buffer
from the ring, if its usage count exceeds 1 then drop it from the ring
and get another buffer. 1 would be the expected case if no one had
touched it since we last used it.

>> A heapscan would pin the buffer only once and hence bump its count at
>> most once, so I don't see a big problem here. Also, I'd argue that
>> buffers that had a positive usage_count shouldn't get sucked into the
>> ring to begin with.

> True, except that with the synchronized scans patch two synchronized
> scans will pin the buffer twice.

Hmm. But we probably don't want the same buffer in two different
backends' rings, either. You *sure* the sync-scan patch has no
interaction with this one?

One other question: I see the patch sets the threshold for switching
from normal to ring-buffer heapscans at table size = NBuffers. Why
so high? I'd have expected maybe at most NBuffers/4 or NBuffers/10.
If you don't want a seqscan blowing out your buffer cache, you surely
don't want it blowing out 90% of the cache either.

regards, tom lane

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-28 21:56:32
Message-ID:	878xb8k3xr.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Is there a reason UnpinBuffer has to be the one to increment the usage count
anyways? Why can't ReadBuffer handle incrementing the count and just trust
that it won't be decremented until the buffer is unpinned anyways?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-28 22:16:13
Message-ID:	11912.1180390573@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Is there a reason UnpinBuffer has to be the one to increment the usage count
> anyways? Why can't ReadBuffer handle incrementing the count and just trust
> that it won't be decremented until the buffer is unpinned anyways?

That's a good question. I think the idea was that if we hold a buffer
pinned for awhile (long enough that the bgwriter's clock sweep passes
over it one or more times), we want the usage count decrementing to
start when we release the pin, not when we acquire it. But maybe that
could be fixed if the clock sweep doesn't touch the usage_count of a
pinned buffer. Which in fact it may not do already --- didn't look.

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-29 01:50:47
Message-ID:	Pine.GSO.4.64.0705282146090.28443@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On Mon, 28 May 2007, Tom Lane wrote:

> But maybe that could be fixed if the clock sweep doesn't touch the
> usage_count of a pinned buffer. Which in fact it may not do already ---
> didn't look.

StrategyGetBuffer doesn't care whether the buffer is pinned or not; it
decrements the usage_count regardless.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-29 15:50:26
Message-ID:	20070529155026.GD4667@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Tom Lane wrote:
> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> > Is there a reason UnpinBuffer has to be the one to increment the usage count
> > anyways? Why can't ReadBuffer handle incrementing the count and just trust
> > that it won't be decremented until the buffer is unpinned anyways?
>
> That's a good question. I think the idea was that if we hold a buffer
> pinned for awhile (long enough that the bgwriter's clock sweep passes
> over it one or more times), we want the usage count decrementing to
> start when we release the pin, not when we acquire it. But maybe that
> could be fixed if the clock sweep doesn't touch the usage_count of a
> pinned buffer. Which in fact it may not do already --- didn't look.

It does -- in BgBufferSync the "all" scan calls SyncOneBuffer with
skip_pinned=false. The "lru" scan does skip pinned buffers.

--
Alvaro Herrera Developer, http://www.PostgreSQL.org/
"World domination is proceeding according to plan" (Andrew Morton)

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-29 17:23:16
Message-ID:	465C6184.6020206@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Alvaro Herrera wrote:
> Tom Lane wrote:
>> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>>> Is there a reason UnpinBuffer has to be the one to increment the usage count
>>> anyways? Why can't ReadBuffer handle incrementing the count and just trust
>>> that it won't be decremented until the buffer is unpinned anyways?
>> That's a good question. I think the idea was that if we hold a buffer
>> pinned for awhile (long enough that the bgwriter's clock sweep passes
>> over it one or more times), we want the usage count decrementing to
>> start when we release the pin, not when we acquire it. But maybe that
>> could be fixed if the clock sweep doesn't touch the usage_count of a
>> pinned buffer. Which in fact it may not do already --- didn't look.
>
> It does -- in BgBufferSync the "all" scan calls SyncOneBuffer with
> skip_pinned=false. The "lru" scan does skip pinned buffers.

You're looking at the wrong place. StrategyGetBuffer drives the clock
sweep, and it always decreases the usage_count, IOW it doesn't skip
pinned buffers. SyncOneBuffer and BgBufferSync don't decrease the
usage_count in any case.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-30 00:43:19
Message-ID:	1180485799.26915.102.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On Mon, 2007-05-28 at 17:36 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> > One idea is to keep track which pins are taken using the bulk strategy.
> > It's a bit tricky when a buffer is pinned multiple times since we don't
> > know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we
> > could get away with just a flag per pinned buffer. Set the flag when a
> > buffer is pinned with bulk strategy and it wasn't pinned by us before,
> > and clear it when it's pinned with another strategy. I'm thinking we
> > steal one bit from PrivateRefCount for this.
>
> Seems like a mess. Why don't we just fix it so there's no need for
> different behavior at Unpin time? The facts on the ground are that
> the current patch's change in UnpinBuffer is a no-op anyway, because
> of the tupletableslot interference.
>
> The behavior I'm imagining is just that when we try to take a buffer
> from the ring, if its usage count exceeds 1 then drop it from the ring
> and get another buffer. 1 would be the expected case if no one had
> touched it since we last used it.
>
> >> A heapscan would pin the buffer only once and hence bump its count at
> >> most once, so I don't see a big problem here. Also, I'd argue that
> >> buffers that had a positive usage_count shouldn't get sucked into the
> >> ring to begin with.
>
> > True, except that with the synchronized scans patch two synchronized
> > scans will pin the buffer twice.
>
> Hmm. But we probably don't want the same buffer in two different
> backends' rings, either. You *sure* the sync-scan patch has no
> interaction with this one?
>

I will run some tests again tonight, I think the interaction needs more
testing than I did originally. Also, I'm not sure that the hardware I
have is sufficient to test those cases.

It looks like the case to worry about is when there are a large number
of scans on the same table and the I/O system is fast enough that it
causes lock contention on the buffers in the rings. Is this the case
you're worried about?

Also, keep in mind that I have added a SyncScanLock after I ran those
tests. That could have an effect.

Regards,
Jeff Davis

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-30 18:27:46
Message-ID:	465DC222.80309@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>>> A heapscan would pin the buffer only once and hence bump its count at
>>> most once, so I don't see a big problem here. Also, I'd argue that
>>> buffers that had a positive usage_count shouldn't get sucked into the
>>> ring to begin with.
>
>> True, except that with the synchronized scans patch two synchronized
>> scans will pin the buffer twice.
>
> Hmm. But we probably don't want the same buffer in two different
> backends' rings, either. You *sure* the sync-scan patch has no
> interaction with this one?

A buffer is only put to the ring when it's requested with
StrategyGetBuffer. If a page is requested with ReadBuffer, and it's in
cache already, it won't be put in the ring. When multiple scanners are
scanning synchronously, they will all use their own, distinct rings when
reading buffers into the cache, and "peek" into the buffers in other
backend's rings for pages that others read to the buffer cache.

I'm going to test the interactions in more detail...

> One other question: I see the patch sets the threshold for switching
> from normal to ring-buffer heapscans at table size = NBuffers. Why
> so high? I'd have expected maybe at most NBuffers/4 or NBuffers/10.
> If you don't want a seqscan blowing out your buffer cache, you surely
> don't want it blowing out 90% of the cache either.

NBuffers is the maximum value that makes sense; if you're scanning more
than NBuffers, the scan is definitely not going to fit in
shared_buffers. Anything less than that and we might be causing harm to
some use cases, so I chose that for the time being.

Simon argued for a GUC variable, and Jeff's patch as it stands
introduces one. I'm not sure we want it but if we do, we should use the
same variable to control both the sync scan and cache replacement
policy. It's essentially "how large a scan do you expect to fit in
shared_buffers?"

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-30 19:08:41
Message-ID:	1180552121.26915.149.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On Tue, 2007-05-29 at 17:43 -0700, Jeff Davis wrote:
> > Hmm. But we probably don't want the same buffer in two different
> > backends' rings, either. You *sure* the sync-scan patch has no
> > interaction with this one?
> >
>
> I will run some tests again tonight, I think the interaction needs more
> testing than I did originally. Also, I'm not sure that the hardware I
> have is sufficient to test those cases.
>

I ran some sanity tests last night with cvs head, plus my syncscan20-
cvshead.patch, plus scan_recycle_buffers.v3.patch.

It passed the sanity tests at least.

I did see that there was more interference with sync_seqscan_threshold=0
(always on) and scan_recycle_buffers=0 (off) than I had previously seen
with 8.2.4, so I will test again against 8.2.4 to see why that might be.
The interference that I saw was still quite small, a scan moving
concurrently with 9 other scans was about 10% slower than a scan running
alone -- which is still very good compared with plain cvs head and no
sync scan -- it's just not ideal.

However, turning scan_recycle_buffers between 0 (off), 16, 32, and 128
didn't have much effect. At 32 it appeared to be about 1% worse during
10 scans, but that may have been noise. The other values I tried didn't
have any difference that I could see.

This was really just a quick sanity test, I think more hard data would
be useful.

Regards,
Jeff Davis

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-30 19:21:50
Message-ID:	16695.1180552910@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> One other question: I see the patch sets the threshold for switching
>> from normal to ring-buffer heapscans at table size = NBuffers. Why
>> so high? I'd have expected maybe at most NBuffers/4 or NBuffers/10.
>> If you don't want a seqscan blowing out your buffer cache, you surely
>> don't want it blowing out 90% of the cache either.

> NBuffers is the maximum value that makes sense; if you're scanning more
> than NBuffers, the scan is definitely not going to fit in
> shared_buffers. Anything less than that and we might be causing harm to
> some use cases, so I chose that for the time being.

But the flip side of that is you're failing to provide the benefit of
the patch in quite a lot of use-cases where it's clearly beneficial.
I just don't believe that there are very many cases where people will
want a heapscan to eat 90% of their cache.

> Simon argued for a GUC variable, and Jeff's patch as it stands
> introduces one. I'm not sure we want it but if we do, we should use the
> same variable to control both the sync scan and cache replacement
> policy. It's essentially "how large a scan do you expect to fit in
> shared_buffers?"

Well, let's do some experiments and see if there's really any point in
varying the cutover.

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-30 19:36:54
Message-ID:	465DD256.3010702@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Jeff Davis wrote:
> On Tue, 2007-05-29 at 17:43 -0700, Jeff Davis wrote:
>>> Hmm. But we probably don't want the same buffer in two different
>>> backends' rings, either. You *sure* the sync-scan patch has no
>>> interaction with this one?
>>>
>> I will run some tests again tonight, I think the interaction needs more
>> testing than I did originally. Also, I'm not sure that the hardware I
>> have is sufficient to test those cases.
>>
>
> I ran some sanity tests last night with cvs head, plus my syncscan20-
> cvshead.patch, plus scan_recycle_buffers.v3.patch.
>
> It passed the sanity tests at least.
>
> I did see that there was more interference with sync_seqscan_threshold=0
> (always on) and scan_recycle_buffers=0 (off) than I had previously seen
> with 8.2.4, so I will test again against 8.2.4 to see why that might be.
> The interference that I saw was still quite small, a scan moving
> concurrently with 9 other scans was about 10% slower than a scan running
> alone -- which is still very good compared with plain cvs head and no
> sync scan -- it's just not ideal.
>
> However, turning scan_recycle_buffers between 0 (off), 16, 32, and 128
> didn't have much effect. At 32 it appeared to be about 1% worse during
> 10 scans, but that may have been noise. The other values I tried didn't
> have any difference that I could see.
>
> This was really just a quick sanity test, I think more hard data would
> be useful.

The interesting question is whether the small buffer ring is big enough
to let all synchronized scans to process a page before it's being
recycled. Keep an eye on pg_buffercache to see if it gets filled with
pages from the table you're querying.

I just ran a quick test with 4 concurrent scans on a dual-core system,
and it looks like we do "leak" buffers from the rings because they're
pinned at the time they would be recycled. A full scan of a 30GB table
took just under 7 minutes, and starting after a postmaster restart it
took ~4-5 minutes until all of the 320MB of shared_buffers were used.
That means we're leaking a buffer from the ring very roughly on every
20-40th ReadBuffer call, but I'll put in some proper instrumentation and
test with different configurations to get a better picture.

The synchronized scans gives such a big benefit when it's applicable,
that I think that some cache-spoiling is acceptable and in fact
unavoidable in some scenarios. It's much better than 8.2 behavior anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-30 19:56:38
Message-ID:	17918.1180554998@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> I just ran a quick test with 4 concurrent scans on a dual-core system,
> and it looks like we do "leak" buffers from the rings because they're
> pinned at the time they would be recycled.

Yeah, I noticed the same in some tests here. I think there's not a lot
we can do about that; we don't have enough visibility into why someone
else has the buffer pinned.

Using a larger ring would help, by making it less probable that any
other sync-scanning backend is so far behind as to still have the oldest
element of our ring pinned. But if we do that we have the L2-cache-size
effect to worry about. Is there any actual data backing up that it's
useful to keep the ring fitting in L2, or is that just guesswork? In
the sync-scan case the idea seems pretty bogus anyway, because the
actual working set will be N backends' rings not just one.

regards, tom lane

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-30 20:43:15
Message-ID:	1180557795.26915.157.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On Wed, 2007-05-30 at 15:56 -0400, Tom Lane wrote:
> In
> the sync-scan case the idea seems pretty bogus anyway, because the
> actual working set will be N backends' rings not just one.

I don't follow. Ideally, in the sync-scan case, the sets of buffers in
the ring of different scans on the same relation will overlap
completely, right?

We might not be at the ideal, but the sets of buffers in the rings
shouldn't be disjoint, should they?

Regards,
Jeff Davis

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-30 21:45:51
Message-ID:	21802.1180561551@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> On Wed, 2007-05-30 at 15:56 -0400, Tom Lane wrote:
>> In the sync-scan case the idea seems pretty bogus anyway, because the
>> actual working set will be N backends' rings not just one.

> I don't follow. Ideally, in the sync-scan case, the sets of buffers in
> the ring of different scans on the same relation will overlap
> completely, right?

> We might not be at the ideal, but the sets of buffers in the rings
> shouldn't be disjoint, should they?

According to Heikki's explanation here
http://archives.postgresql.org/pgsql-patches/2007-05/msg00498.php
each backend doing a heapscan would collect its own ring of buffers.
You might have a few backends that are always followers, never leaders,
and so never actually fetch any pages --- but for each backend that
actually did any I/O there would be a separate ring. In practice I'd
expect the lead would "change hands" pretty often and so you'd see all
the backends accumulating their own rings.

regards, tom lane

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-30 22:23:58
Message-ID:	1180563838.26915.192.camel@dogma.v10.wvs
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On Wed, 2007-05-30 at 17:45 -0400, Tom Lane wrote:
> According to Heikki's explanation here
> http://archives.postgresql.org/pgsql-patches/2007-05/msg00498.php
> each backend doing a heapscan would collect its own ring of buffers.
> You might have a few backends that are always followers, never leaders,
> and so never actually fetch any pages --- but for each backend that
> actually did any I/O there would be a separate ring. In practice I'd
> expect the lead would "change hands" pretty often and so you'd see all
> the backends accumulating their own rings.
>

Oh, I see what you mean. The rings will actually become sparsely
populated with many concurrent scans, and therefore, extend outside of
L2 cache.

Regards,
Jeff Davis

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-05-31 07:40:15
Message-ID:	465E7BDF.9060907@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> I just ran a quick test with 4 concurrent scans on a dual-core system,
>> and it looks like we do "leak" buffers from the rings because they're
>> pinned at the time they would be recycled.
>
> Yeah, I noticed the same in some tests here. I think there's not a lot
> we can do about that; we don't have enough visibility into why someone
> else has the buffer pinned.

We could stash pinned buffers to some other list etc. and try them again
later. But that gets a lot more complex.

> Using a larger ring would help, by making it less probable that any
> other sync-scanning backend is so far behind as to still have the oldest
> element of our ring pinned. But if we do that we have the L2-cache-size
> effect to worry about. Is there any actual data backing up that it's
> useful to keep the ring fitting in L2, or is that just guesswork? In
> the sync-scan case the idea seems pretty bogus anyway, because the
> actual working set will be N backends' rings not just one.

Yes, I tested different ring sizes here:
http://archives.postgresql.org/pgsql-hackers/2007-05/msg00469.php

The tests above showed the effect when reading a table from OS cache. I
haven't seen direct evidence supporting Luke's claim that the ring makes
scans of tables bigger than RAM go faster with bigger I/O hardware,
because I don't have such hardware at hand. We did repeat the tests on
different hardware however, and monitored the CPU usage with vmstat at
the same time. The CPU usage was significantly lower with the patch, so
I believe that with better I/O hardware the test would become limited by
CPU and the patch would therefore make it go faster.

BTW, we've been talking about the "L2 cache effect" but we don't really
know for sure if the effect has anything to do with the L2 cache. But
whatever it is, it's real.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Luke Lonergan <llonergan(at)greenplum(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-06-02 17:47:58
Message-ID:	C286FB5E.3234D%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Hi All,

On 5/31/07 12:40 AM, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> wrote:

> BTW, we've been talking about the "L2 cache effect" but we don't really
> know for sure if the effect has anything to do with the L2 cache. But
> whatever it is, it's real.

The mailing list archives contain the ample evidence of:
- it's definitely an L2 cache effect
- on fast I/O hardware tests show large benefits of keeping the ring in L2

I see no reason to re-open the discussion about these, can we accept these
as fact and continue?

- Luke

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: Seq scans status update
Date:	2007-06-02 18:08:45
Message-ID:	25616.1180807725@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Luke Lonergan <llonergan(at)greenplum(dot)com> writes:
> The mailing list archives contain the ample evidence of:
> - it's definitely an L2 cache effect
> - on fast I/O hardware tests show large benefits of keeping the ring in L2

Please provide some specific pointers, because I don't remember that.

regards, tom lane