Initial prefetch performance testing

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Initial prefetch performance testing
Date: 2008-09-22 08:57:32
Message-ID: Pine.GSO.4.64.0809220317320.20434@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The complicated patch I've been working with for a while now is labeled
"sequential scan posix fadvise" in the CommitFest queue. There are a lot
of parts to that, going back to last December, and I've added the many
most relevant links to the September CommitFest page.

The first message there on this topic is
http://archives.postgresql.org/message-id/87ve7egxow.fsf@oxford.xeocode.com
which is a program from Greg Stark that measures how much prefetching
advisory information improves the overall transfer speed on a synthetic
random read benchmark. The idea is that you advise the OS about up to n
requests at a time, where n goes from 1 (no prefetch at all) to 8192. As
n goes up, the total net bandwidth usually goes up as well. You can
basically divide the bandwidth at any prefetch level by the baseline (1=no
prefetch) to get a speedup multiplier. The program allows you to submit
both unsorted and sorted requests, and the speedup is pretty large and
similarly distributed (but of different magnitude) in both cases.

While not a useful PostgreSQL patch on its own, this program lets one
figure out if the basic idea here, advise about blocks ahead of time to
speed up the whole thing, works on a particular system without having to
cope with a larger test. What I have to report here are some results from
many systems running both Linux and Solaris with various numbers of disk
spindles. The Linux systems use the posix fadvise call, while the Solaris
ones use its aio library.

Using the maximum prefetch working set tested, 8192, here's the speedup
multiplier on this benchmark for both sorted and unsorted requests using a
8GB file:

OS Spindles Unsorted X Sorted X
1:Linux 1 2.3 2.1
2:Linux 1 1.5 1.0
3:Solaris 1 2.6 3.0
4:Linux 3 6.3 2.8
5:Linux (Stark) 3 5.3 3.6
6:Linux 10 5.4 4.9
7:Solaris* 48 16.9 9.2

Systems (1)-(3) are standard single-disk workstations with various speed
and size disks. (4) is a 3-disk software RAID0 (on an Areca card in JBOD
mode). (5) is the system Greg Stark originally reported his results on,
which is also a 3-disk array of some sort. (6) uses a Sun 2640 disk array
with a 10 disk RAID0+1 setup, while (7) is a Sun Fire X4500 with 48 disks
in a giant RAID-Z array.

The Linux systems drop the OS cache after each run, they're all running
kernel 2.6.18 or higher with that feature. Solaris system (3) is using
the UFS filesystem with the default tuning, which doesn't cache enough
information for that to be necessary[1]--the results look very similar to
the Linux case even without explicitly dropping the cache.

* For (7) the results there showed obvious caching (>150MB/s), as I
expected from Solaris's ZFS which does cache aggressively by default. In
order to get useful results with the server's 16GB of RAM, I increased the
test file to 64GB, at which point the results looked reasonable.

Comparing with a prefetch working set of 256, which I eyeballed on the
results spreadsheet I made as the best return on prefetch effort before
improvements leveled off, the speedups looked like this:

OS Spindles Unsorted X Sorted X
1:Linux 1 2.3 2.0
2:Linux 1 1.5 0.9
3:Solaris 1 2.5 3.3
4:Linux 3 5.8 2.6
5:Linux (Stark) 3 5.6 3.7
6:Linux 10 5.7 5.1
7:Solaris 48 10.0 7.8

Observations:

-For the most part, using the fadvise/aio technique was a significant win
even on single disk systems. The worst result, on system (2) with sorted
blocks, was basically break even within the measurement tolerance here:
94% of the no prefetch rate is the worst result I saw, but all these
bounced around about +/- 5% so I wouldn't read too much into that. In
every other case, there was at least a 50% speed increase even with a
single disk.

-As Greg Stark suggested, the larger the spindle count the larger the
speedup, and the larger the prefetch size that might make sense. His
suggestion to model the user GUC as "effective_spindle_count" looks like a
good one. The sequential scan fadvise implementation patch submitted uses
the earlier preread_pages name for that parameter, which I agree seems
less friendly.

-The Solaris aio implementation seems to perform a bit better relative to
no prefetch than the Linux fadvise one. I'm left wondering a bit about
whether that's just a Solaris vs. Linux thing, in particular whether
that's just some lucky caching on Solaris where the cache isn't completely
cleared, or whether Linux's aio library might work better than its fadvise
call does.

The attached archive file includes a couple of useful bits for anyone who
wants to try this test on their hardware. I think I filed away all the
rough edges here and it should be real easy for someone else to run this
test now. It includes:

-prefetch.c is a slightly modified version of the original test program.
I fixed a couple of minor bugs in the parameter input/output code that
only showed up under some platform combinations, the actual prefetch
implementation is untouched.

-prefetchtest is a shell script that compiles the program and runs it
against a full range of prefetch sizes. Just run it and tell it where you
want the test data file to go (with an optional size that defaults to
8GB), and it produces an output file named prefetch-results.csv with all
the results in it.

-I included all of the raw data for the various systems I tested so other
testers have baselines to compare against. An OpenOffice spreadsheet
comparing all the results and that computes the ratios shown above is also
included.

Conclusion: on all the systems I tested on, this approach gave excellent
results, which makes me feel confident that I should see a corresponding
speedup on database-level tests that use this same basic technique. I'm
not sure whether it might make sense to bundle this test program up
somehow so others can use it for similar compatibility tests (I'm thinking
of something similar to contrib/test_fsync), will revisit that after the
rest of the review.

Next step: I've got two data sets (one generated, one real-world sample)
that should demonstrate a useful heap scan prefetch speedup, and one test
program I think will demonstrate whether the sequential scan prefetch code
works right. Now that I've vetted all the hardware/OS combinations I hope
I can squeeze that in this week, I don't need to test all of them now that
I know which are the interesting systems.

As far as other platforms go, I should get a Mac OS system in the near
future to test on as well (once I have the database tests working, not
worth scheduling yet), but as it will only have a single disk that will
basically just be a compatibility test rather than a serious performance
one. Would be nice to get a report from someone running FreeBSD to see
what's needed to make the test script run on that OS.

[1] http://blogs.sun.com/jkshah/entry/postgresql_east_2008_talk_best :
Page 8 of the presentation covers just how limited the default UFS cache
tuning is.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

Attachment Content-Type Size
fadvise-prefetch.tar.gz application/octet-stream 16.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hans-Jürgen Schönig 2008-09-22 08:59:18 Re: Toasted table not deleted when no out of line columns left
Previous Message Dimitri Fontaine 2008-09-22 07:53:16 Re: parallel pg_restore