Initial prefetch performance testing

Lists: pgsql-hackers
From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Initial prefetch performance testing
Date: 2008-09-22 08:57:32
Message-ID: Pine.GSO.4.64.0809220317320.20434@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The complicated patch I've been working with for a while now is labeled
"sequential scan posix fadvise" in the CommitFest queue. There are a lot
of parts to that, going back to last December, and I've added the many
most relevant links to the September CommitFest page.

The first message there on this topic is
http://archives.postgresql.org/message-id/87ve7egxow.fsf@oxford.xeocode.com
which is a program from Greg Stark that measures how much prefetching
advisory information improves the overall transfer speed on a synthetic
random read benchmark. The idea is that you advise the OS about up to n
requests at a time, where n goes from 1 (no prefetch at all) to 8192. As
n goes up, the total net bandwidth usually goes up as well. You can
basically divide the bandwidth at any prefetch level by the baseline (1=no
prefetch) to get a speedup multiplier. The program allows you to submit
both unsorted and sorted requests, and the speedup is pretty large and
similarly distributed (but of different magnitude) in both cases.

While not a useful PostgreSQL patch on its own, this program lets one
figure out if the basic idea here, advise about blocks ahead of time to
speed up the whole thing, works on a particular system without having to
cope with a larger test. What I have to report here are some results from
many systems running both Linux and Solaris with various numbers of disk
spindles. The Linux systems use the posix fadvise call, while the Solaris
ones use its aio library.

Using the maximum prefetch working set tested, 8192, here's the speedup
multiplier on this benchmark for both sorted and unsorted requests using a
8GB file:

OS Spindles Unsorted X Sorted X
1:Linux 1 2.3 2.1
2:Linux 1 1.5 1.0
3:Solaris 1 2.6 3.0
4:Linux 3 6.3 2.8
5:Linux (Stark) 3 5.3 3.6
6:Linux 10 5.4 4.9
7:Solaris* 48 16.9 9.2

Systems (1)-(3) are standard single-disk workstations with various speed
and size disks. (4) is a 3-disk software RAID0 (on an Areca card in JBOD
mode). (5) is the system Greg Stark originally reported his results on,
which is also a 3-disk array of some sort. (6) uses a Sun 2640 disk array
with a 10 disk RAID0+1 setup, while (7) is a Sun Fire X4500 with 48 disks
in a giant RAID-Z array.

The Linux systems drop the OS cache after each run, they're all running
kernel 2.6.18 or higher with that feature. Solaris system (3) is using
the UFS filesystem with the default tuning, which doesn't cache enough
information for that to be necessary[1]--the results look very similar to
the Linux case even without explicitly dropping the cache.

* For (7) the results there showed obvious caching (>150MB/s), as I
expected from Solaris's ZFS which does cache aggressively by default. In
order to get useful results with the server's 16GB of RAM, I increased the
test file to 64GB, at which point the results looked reasonable.

Comparing with a prefetch working set of 256, which I eyeballed on the
results spreadsheet I made as the best return on prefetch effort before
improvements leveled off, the speedups looked like this:

OS Spindles Unsorted X Sorted X
1:Linux 1 2.3 2.0
2:Linux 1 1.5 0.9
3:Solaris 1 2.5 3.3
4:Linux 3 5.8 2.6
5:Linux (Stark) 3 5.6 3.7
6:Linux 10 5.7 5.1
7:Solaris 48 10.0 7.8

Observations:

-For the most part, using the fadvise/aio technique was a significant win
even on single disk systems. The worst result, on system (2) with sorted
blocks, was basically break even within the measurement tolerance here:
94% of the no prefetch rate is the worst result I saw, but all these
bounced around about +/- 5% so I wouldn't read too much into that. In
every other case, there was at least a 50% speed increase even with a
single disk.

-As Greg Stark suggested, the larger the spindle count the larger the
speedup, and the larger the prefetch size that might make sense. His
suggestion to model the user GUC as "effective_spindle_count" looks like a
good one. The sequential scan fadvise implementation patch submitted uses
the earlier preread_pages name for that parameter, which I agree seems
less friendly.

-The Solaris aio implementation seems to perform a bit better relative to
no prefetch than the Linux fadvise one. I'm left wondering a bit about
whether that's just a Solaris vs. Linux thing, in particular whether
that's just some lucky caching on Solaris where the cache isn't completely
cleared, or whether Linux's aio library might work better than its fadvise
call does.

The attached archive file includes a couple of useful bits for anyone who
wants to try this test on their hardware. I think I filed away all the
rough edges here and it should be real easy for someone else to run this
test now. It includes:

-prefetch.c is a slightly modified version of the original test program.
I fixed a couple of minor bugs in the parameter input/output code that
only showed up under some platform combinations, the actual prefetch
implementation is untouched.

-prefetchtest is a shell script that compiles the program and runs it
against a full range of prefetch sizes. Just run it and tell it where you
want the test data file to go (with an optional size that defaults to
8GB), and it produces an output file named prefetch-results.csv with all
the results in it.

-I included all of the raw data for the various systems I tested so other
testers have baselines to compare against. An OpenOffice spreadsheet
comparing all the results and that computes the ratios shown above is also
included.

Conclusion: on all the systems I tested on, this approach gave excellent
results, which makes me feel confident that I should see a corresponding
speedup on database-level tests that use this same basic technique. I'm
not sure whether it might make sense to bundle this test program up
somehow so others can use it for similar compatibility tests (I'm thinking
of something similar to contrib/test_fsync), will revisit that after the
rest of the review.

Next step: I've got two data sets (one generated, one real-world sample)
that should demonstrate a useful heap scan prefetch speedup, and one test
program I think will demonstrate whether the sequential scan prefetch code
works right. Now that I've vetted all the hardware/OS combinations I hope
I can squeeze that in this week, I don't need to test all of them now that
I know which are the interesting systems.

As far as other platforms go, I should get a Mac OS system in the near
future to test on as well (once I have the database tests working, not
worth scheduling yet), but as it will only have a single disk that will
basically just be a compatibility test rather than a serious performance
one. Would be nice to get a report from someone running FreeBSD to see
what's needed to make the test script run on that OS.

[1] http://blogs.sun.com/jkshah/entry/postgresql_east_2008_talk_best :
Page 8 of the presentation covers just how limited the default UFS cache
tuning is.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

Attachment Content-Type Size
fadvise-prefetch.tar.gz application/octet-stream 16.9 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-22 10:02:16
Message-ID: 1222077736.4445.148.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Mon, 2008-09-22 at 04:57 -0400, Greg Smith wrote:

> -As Greg Stark suggested, the larger the spindle count the larger the
> speedup, and the larger the prefetch size that might make sense. His
> suggestion to model the user GUC as "effective_spindle_count" looks like a
> good one. The sequential scan fadvise implementation patch submitted uses
> the earlier preread_pages name for that parameter, which I agree seems
> less friendly.

Good news about the testing.

I'd prefer to set this as a tablespace level storage parameter. Since
that is where it would need to live when we have multiple tablespaces.
Specifically as a storage parameter, so we have same syntax for
table-level and tablespace-level storage parameters. That would also
allow us to have tablespace-level defaults for table-level settings.

prefetch_... is a much better name since its an existing industry term.
I'm not in favour of introducing the concept of spindles, since I can
almost hear the questions about ramdisks and memory-based storage. Plus
I don't ever want to discover that the best setting for
effective_spindles is 7 (or 5) when I have 6 disks because of some
technology shift or postgres behaviour change in the future.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-22 11:34:04
Message-ID: C3F7C763-36B0-4327-A4FD-636F3CFC83CD@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Sep 22, 2008, at 12:02 PM, Simon Riggs wrote:

>
> On Mon, 2008-09-22 at 04:57 -0400, Greg Smith wrote:
>
>> -As Greg Stark suggested, the larger the spindle count the larger the
>> speedup, and the larger the prefetch size that might make sense. His
>> suggestion to model the user GUC as "effective_spindle_count" looks
>> like a
>> good one. The sequential scan fadvise implementation patch
>> submitted uses
>> the earlier preread_pages name for that parameter, which I agree
>> seems
>> less friendly.
>
> Good news about the testing.

absolutely; we made tests and got similar figures.
also, I/O is much more stable and steady with the patch.

>
> I'd prefer to set this as a tablespace level storage parameter. Since
> that is where it would need to live when we have multiple tablespaces.
> Specifically as a storage parameter, so we have same syntax for
> table-level and tablespace-level storage parameters. That would also
> allow us to have tablespace-level defaults for table-level settings.
>

+1

> prefetch_... is a much better name since its an existing industry
> term.
> I'm not in favour of introducing the concept of spindles, since I can
> almost hear the questions about ramdisks and memory-based storage.
> Plus
> I don't ever want to discover that the best setting for
> effective_spindles is 7 (or 5) when I have 6 disks because of some
> technology shift or postgres behaviour change in the future.

i would definitely avoid to use of "spindles".
i totally agree with simon here. once mature SSD storage or some in-
memory stuff will be available for the masses, this is not suitable
anymore.
the best thing would be to simply use the parameter as it was in the
original patch.
maybe we should simply make the parameter adjustable per table and per
index. this would automatically cover 95% of all cases such as
clustered tables and so on.

many thanks and best regards,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: www.postgresql-support.de


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-22 15:46:23
Message-ID: 877i94kwts.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:

> On Mon, 2008-09-22 at 04:57 -0400, Greg Smith wrote:
>
>> -As Greg Stark suggested, the larger the spindle count the larger the
>> speedup, and the larger the prefetch size that might make sense. His
>> suggestion to model the user GUC as "effective_spindle_count" looks like a
>> good one. The sequential scan fadvise implementation patch submitted uses
>> the earlier preread_pages name for that parameter, which I agree seems
>> less friendly.
>
> Good news about the testing.
>
> I'd prefer to set this as a tablespace level storage parameter.

Sounds, like a good idea, except... what's a tablespace level storage parameter?

> prefetch_... is a much better name since its an existing industry term.
> I'm not in favour of introducing the concept of spindles, since I can
> almost hear the questions about ramdisks and memory-based storage. Plus
> I don't ever want to discover that the best setting for
> effective_spindles is 7 (or 5) when I have 6 disks because of some
> technology shift or postgres behaviour change in the future.

In principle I quite strongly disagree with this.

Someone might very well want to set spindle_count to 6 when he actually has 7
but at least he can have an intuitive feel for what he's doing -- he's setting
it to slightly less than Postgres thinks is optimal.

Number of blocks to prefetch is an internal implementation detail that the DBA
has absolutely no way to know what the correct value is. That's how we get the
cargo cult configuration tweaks we've seen in the past where people follow
recommendations with no idea what the consequences are or whether they apply.

In an ideal world we would have a half-dozen parameters to tell Postgres how
much memory is available, how many disks available, etc and Postgres would
know how best to use the resources. I think if we expose internal knobs like
you propose then we end up with hundreds of parameters and to adjust them
you'll have to be an expert in Postgres internals.

That said, there is a place for these internal knobs when we don't really know
how to best make use of resources. At this point we only have results from a
few systems and the results don't seem to jibe with the theory.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-22 16:28:31
Message-ID: 1222100911.4445.200.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Mon, 2008-09-22 at 16:46 +0100, Gregory Stark wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>
> > On Mon, 2008-09-22 at 04:57 -0400, Greg Smith wrote:
> >
> >> -As Greg Stark suggested, the larger the spindle count the larger the
> >> speedup, and the larger the prefetch size that might make sense. His
> >> suggestion to model the user GUC as "effective_spindle_count" looks like a
> >> good one. The sequential scan fadvise implementation patch submitted uses
> >> the earlier preread_pages name for that parameter, which I agree seems
> >> less friendly.
> >
> > Good news about the testing.
> >
> > I'd prefer to set this as a tablespace level storage parameter.
>
> Sounds, like a good idea, except... what's a tablespace level storage parameter?

A storage parameter, just at tablespace level.

WITH (storage_parameter = value)

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-22 17:06:12
Message-ID: Pine.GSO.4.64.0809221137050.10700@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 22 Sep 2008, Simon Riggs wrote:

> I'd prefer to set this as a tablespace level storage parameter.

That seems reasonable, but I'm not working at that level yet. There's
still a larger open questions about how the buffer manager interaction
will work here, and I'd like to have a better view of that first before
getting into the exact syntax used to set the parameter. For now, a GUC
works well enough, but you're right that something finer-grained may make
sense before this actually hits the codebase.

> prefetch_... is a much better name since its an existing industry term.
> I'm not in favour of introducing the concept of spindles, since I can
> almost hear the questions about ramdisks and memory-based storage.

It's possible to make a case for exposing the internal number that's
getting varied here, naming the parameter something like prefetch_depth,
and letting people set that to whatever they want. Based on the current
data I might suggest a default of 256, using 0 to turn the feature off
altogether, and a maximum of at least 8192 and possibly more.

In practice I expect there to only be a couple of popular values and the
idea of fine-tuning is a bit questionable. I think that's what Greg Stark
was driving at with how the value was re-spun. Instead of using
effective_spindle_count, you could just as easily make a case for an enum
like [off,low,medium,high] mapping to [0,16,256,8192]. From what I've
seen so far, that would reduce tweaking time in the field considerably
while not really changing the range of available behavior very much.

I will be running a set of tests on a fast SSD device before I'm done,
that's another one that I'll try once I've got the database-level tests
ready to run, too. What I expect is that it will favor 0, presumably you
might as well just read the blocks rather than advise about them when the
seek overhead is close to zero. Should be able to do a RAM disk run as
well.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-22 17:09:01
Message-ID: 8763onvzyp.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


[resending due to the attachment being too large for the -hackers list --
weren't we going to raise it when we killed -patches?]

Greg Smith <gsmith(at)gregsmith(dot)com> writes:

> Using the maximum prefetch working set tested, 8192, here's the speedup
> multiplier on this benchmark for both sorted and unsorted requests using a 8GB
> file:
>
> OS Spindles Unsorted X Sorted X
> 1:Linux 1 2.3 2.1
> 2:Linux 1 1.5 1.0
> 3:Solaris 1 2.6 3.0
> 4:Linux 3 6.3 2.8
> 5:Linux (Stark) 3 5.3 3.6
> 6:Linux 10 5.4 4.9
> 7:Solaris* 48 16.9 9.2

Incidentally I've been looking primarily at the sorted numbers because they
parallel bitmap heap scans. (Note that the heap scan is only about half the
i/o of a bitmap index scan + heap scan so even if it's infinitely faster it'll
only halve the time spent in the two nodes.)

Hm, I'm disappointed with the 48-drive array here. I wonder why it maxed out
at only 10x the bandwidth of one drive. I would expect more like 24x or more.
I wonder if Solaris's aio has an internal limit on how many pending i/o
requests it can handle. Perhaps it's a tunable?

Unfortunately I don't see a convenient low-invasive way to integrate aio into
Postgres. posix_fadvise we can just issue the advice and then forget about it.
But aio we would pretty much have to pick a target buffer, pin it, issue the
aio and then remember the pin later when we need to read the buffer. That
would require restructuring the code significantly. I'm quite surprised
Solaris doesn't support posix_fadvise -- perhaps it's in some other version of
Solaris?

Here's a graph of results from this program for various sized arrays on a
single machine:

http://wiki.postgresql.org/images/a/a3/Results.svg

Each colour corresponds to an array of a different number of spindles ranging
from 1 to 15 drives. The X axis is how much prefetching was done and the Y
axis is the bandwidth obtained.

There is a distinct maximum and then dropoff and it would be great to get some
data points for larger arrays to understand where that maximum goes as the
array gets larger.

> Conclusion: on all the systems I tested on, this approach gave excellent
> results, which makes me feel confident that I should see a corresponding
> speedup on database-level tests that use this same basic technique. I'm not
> sure whether it might make sense to bundle this test program up somehow so
> others can use it for similar compatibility tests (I'm thinking of something
> similar to contrib/test_fsync), will revisit that after the rest of the review.
>
> Next step: I've got two data sets (one generated, one real-world sample) that
> should demonstrate a useful heap scan prefetch speedup, and one test program I
> think will demonstrate whether the sequential scan prefetch code works right.
> Now that I've vetted all the hardware/OS combinations I hope I can squeeze that
> in this week, I don't need to test all of them now that I know which are the
> interesting systems.

I have an updated patch I'll be sending along shortly. You might want to test
with that?


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-22 18:14:18
Message-ID: 1222107258.4445.222.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Mon, 2008-09-22 at 13:06 -0400, Greg Smith wrote:

> > prefetch_... is a much better name since its an existing industry term.
> > I'm not in favour of introducing the concept of spindles, since I can
> > almost hear the questions about ramdisks and memory-based storage.
>
> It's possible to make a case for exposing the internal number that's
> getting varied here, naming the parameter something like prefetch_depth,
> and letting people set that to whatever they want. Based on the current
> data I might suggest a default of 256, using 0 to turn the feature off
> altogether, and a maximum of at least 8192 and possibly more.
>
> In practice I expect there to only be a couple of popular values and the
> idea of fine-tuning is a bit questionable. I think that's what Greg Stark
> was driving at with how the value was re-spun. Instead of using
> effective_spindle_count, you could just as easily make a case for an enum
> like [off,low,medium,high] mapping to [0,16,256,8192]. From what I've
> seen so far, that would reduce tweaking time in the field considerably
> while not really changing the range of available behavior very much.

Tuning Postgres I/O already involves quite a few parameters called
buffersize, segment width, stripe size, etc.. I've never heard anything
from a disk manufacturer say this is wrong and we should just have
"spindle equivalents". I don't think we should dress this up too much,
that's all. We aren't going to make anybody's life any easier. But we
will probably generate lots of annoying phone calls to disk
manufacturers asking "so how many spindles is your subsystem worth in
Postgres terms?" to which they will shrug and say "no idea".

Is the behaviour of this sufficiently linear to be able to say that 3
spindles = 3 effective_spindles and 6=6 etc.? I would guess it won't be
and you're left with a name more misleading than useful.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-22 18:15:55
Message-ID: 48D7E0DB.4040109@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> I'm not in favour of introducing the concept of spindles....
>
> In principle I quite strongly disagree with this....
> Number of blocks to prefetch is an internal implementation detail that the DBA
> has absolutely no way to know what the correct value is.

Even more often on systems I see these days, "spindles"
is an implementation detail that the DBA has no way to know
what the correct value is.

For example, on our sites hosted with Amazon's compute cloud (a great
place to host web sites), I know nothing about spindles, but know
about Amazon Elastic Block Store[2]'s and Instance Store's[1]. I
have some specs and are able to run benchmarks on them; but couldn't
guess how many spindles my X% of the N-disk device that corresponds
to. For another example, some of our salesguys with SSD drives
have 0 spindles on their demo machines.

I'd rather a parameter that expressed things more in terms of
measurable quantities -- perhaps seeks/second? perhaps
random-access/sequential-access times?

[1] http://www.amazon.com/gp/browse.html?node=201590011
[2] http://www.amazon.com/b/ref=sc_fe_c_0_201590011_1?ie=UTF8&node=689343011&no=201590011&me=A36L942TSJ2AJA


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-22 19:04:56
Message-ID: Pine.GSO.4.64.0809221429550.1125@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 22 Sep 2008, Gregory Stark wrote:

> Hm, I'm disappointed with the 48-drive array here. I wonder why it maxed out
> at only 10x the bandwidth of one drive. I would expect more like 24x or more.

The ZFS RAID-Z implementation doesn't really scale that linearly. It's
rather hard to get the full bandwidth out of a X4500 with any single
process, and I haven't done any filesystem tuning to improve
things--everything is at the defaults.

> I'm quite surprised Solaris doesn't support posix_fadvise -- perhaps
> it's in some other version of Solaris?

Both the systems I used were standard Solaris 10 boxes and I'm not aware
of any changes in this area in the later OpenSolaris releases (which is
where I'd expect something like this to change first). The test program I
tried failed to find #ifdef POSIX_FADV_WILLNEED, and the message I saw
from you at
http://archives.postgresql.org/message-id/877imua265.fsf@oxford.xeocode.com
suggested you didn't find any fadvise either so I didn't look much
further.

The above is a cue for someone from Sun to chime in on this subject.

> I have an updated patch I'll be sending along shortly. You might want to test
> with that?

Obviously I've got everything setup to test right now, am currently
analyzing your earlier patch and the sequential scan fork that derived
from it. If you've got a later version of the bitmap heap scan one as
well, I'll replace the one I had been planning to test (your
bitmap-preread-v9) with that one when it's available.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-22 20:04:25
Message-ID: 873ajsx7zq.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com> writes:

> For example, on our sites hosted with Amazon's compute cloud (a great
> place to host web sites), I know nothing about spindles, but know
> about Amazon Elastic Block Store[2]'s and Instance Store's[1]. I
> have some specs and are able to run benchmarks on them; but couldn't
> guess how many spindles my X% of the N-disk device that corresponds
> to.

Well I don't see how you're going to guess how much prefetching is optimal for
those environments either...

> For another example, some of our salesguys with SSD drives
> have 0 spindles on their demo machines.

Sounds to me like you're finding it pretty intuitive. Actually you would want
"1" because it can handle one request at a time. Actually if you have a
multipath array I imagine you would want to think of each interface as a
spindle because that's the bottleneck and you'll want to keep all the
interfaces busy.

> I'd rather a parameter that expressed things more in terms of
> measurable quantities -- perhaps seeks/second? perhaps
> random-access/sequential-access times?

Well that's precisely what I'm saying. Simon et al want a parameter to control
how much prefetching to do. That's *not* a measurable quantity. I'm suggesting
effective_spindle_count which *is* a measurable quantity even if it might be a
bit harder to measure in some environments than others.

The two other quantities you describe are both currently represented by our
random_page_cost (or random_page_cost/sequential_page_cost). What we're
dealing with now is an entirely orthogonal property of your system: how many
concurrent requests can the system handle.

If you have ten spindles then you really want to send enough requests to
ensure there are ten concurrent requests being processed on ten different
drives (assuming you want each scan to make maximum use of the resources which
is primarily true in DSS but might not be true in OLTP). That's a lot more
than ten requests though because if you sent ten requests many of them would
end up on the same devices.

In theory my logic led me to think for ten drives it would be about 30.
Experiments seem to show it's more like 300-400. That discrepancy might be a
reason to put this debate aside for now anywaysand expose the internal
implementation until we understand better what's going on there.

Ironically I'm pretty happy to lose this argument because EDB is interested in
rolling this into its dynamic tuning module. If there's a consensus -- by my
count three people have spoken up already which is more than usual -- then
I'll gladly concede. Anyone object to going back to preread_pages? Or should
it be prefetch_pages? prefetch_blocks? Something else?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!


From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-22 20:49:37
Message-ID: 48D804E1.3080802@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com> writes:
>> I'd rather a parameter that expressed things more in terms of
>> measurable quantities [...]
>
> ...What we're
> dealing with now is an entirely orthogonal property of your system: how many
> concurrent requests can the system handle.

Really? I'd have thought you'd want to give the OS enough guesses
about the future that it's elevator algorithms for the drive heads
don't keep seeking back-and-forth but rather do as much per sweep
across a device that they can.

> Ironically I'm pretty happy to lose this argument because EDB is interested in
> rolling this into its dynamic tuning module. If there's a consensus -- by my
> count three people have spoken up already which is more than usual -- then
> I'll gladly concede. Anyone object to going back to preread_pages? Or should
> it be prefetch_pages? prefetch_blocks? Something else?

Well - as you pointed out, I'm not on their side of the debate either.
I'm not sure what a relevant measurable parameter would be so I'm not
being too helpful in the conversation either.


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-23 12:26:58
Message-ID: 48D8E092.2020301@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith napsal(a):
> On Mon, 22 Sep 2008, Gregory Stark wrote:

>
>> I'm quite surprised Solaris doesn't support posix_fadvise -- perhaps
>> it's in some other version of Solaris?

Solaris has only fake variant of posix_fadvise. See
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/posix_fadvise.c

UFS has own optimization. For example if it detects sequential scan then
file cache is limited on 80kB. Or it also has prefetching.

ZFS has intelligent read patter recognitions algorithms and other
improvements.

Zdenek


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-23 13:33:32
Message-ID: 87skrrugur.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith <gsmith(at)gregsmith(dot)com> writes:

> On Mon, 22 Sep 2008, Gregory Stark wrote:
>
>> Hm, I'm disappointed with the 48-drive array here. I wonder why it maxed out
>> at only 10x the bandwidth of one drive. I would expect more like 24x or more.
>
> The ZFS RAID-Z implementation doesn't really scale that linearly. It's rather
> hard to get the full bandwidth out of a X4500 with any single process, and I
> haven't done any filesystem tuning to improve things--everything is at the
> defaults.

Well random access i/o will fall pretty far short of the full bandwidth.
Actually this is a major issue, our sequential_page_cost vs random_page_cost
dichotomy doesn't really work when we're prefetching pages.

In my experiments an array capable of supplying about 1.4GB/s in sequential
i/o could only muster about 40MB/s of random i/o with prefetching and only
about 5MB/s without.

For this machine we would have quite a dilemma setting random_page_cost -- do
we set it to 280 or 35?

Perhaps access paths which expect to be able to prefetch most of their
accesses should use random_page_cost / effective_spindle_count for their i/o
costs?

But then if people don't set random_page_cost high enough they could easily
find themselves with random fetches being costed as less expensive than
sequential fetches. And I have a feeling it'll be a hard sell to get people to
set random_page_cost in the double digits let alone triple digits.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-23 13:49:57
Message-ID: 27545.1222177797@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Perhaps access paths which expect to be able to prefetch most of their
> accesses should use random_page_cost / effective_spindle_count for their i/o
> costs?

> But then if people don't set random_page_cost high enough they could easily
> find themselves with random fetches being costed as less expensive than
> sequential fetches. And I have a feeling it'll be a hard sell to get people to
> set random_page_cost in the double digits let alone triple digits.

Well, we could use something like
Max(random_page_cost / effective_spindle_count, seq_page_cost)
to ensure the result remains somewhat sane.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-23 14:38:40
Message-ID: 87ljxjudu7.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Greg Smith <gsmith(at)gregsmith(dot)com> writes:

>> I have an updated patch I'll be sending along shortly. You might want to test
>> with that?
>
> Obviously I've got everything setup to test right now, am currently analyzing
> your earlier patch and the sequential scan fork that derived from it. If
> you've got a later version of the bitmap heap scan one as well, I'll replace
> the one I had been planning to test (your bitmap-preread-v9) with that one when
> it's available.

Well here you go. It includes:

. Bitmap heap scans (as before)
. Index scans
. Setting POSIX_FADV_SEQUENTIAL for bulk sequential scans
. Improved (ie, debugged) autoconf tests for posix_fadvise
(and posix_fallocate though I don't have any code using it yet)

The bitmap heap scans are still prefetching the number of buffers I estimate
based on effective_spindle_count according to my magic formula. We've been
discussing throwing that out, I just haven't modified this to do that yet.

Index scans prefetch all pages for matching index tuples on the leaf page when
we do page-at-a-time scans. I haven't bothered doing the gradual ramp-up or
keeping a ring of the optimal size prefetched.

The sequential scan stuff is based on Zoltan's posts but done in a different
way. It passes an i/o access strategy to smgr and fd.c which keeps track of
what the previous strategy was and calls posix_fadvise if it's changed. This
will correctly handle queries which reference the same table twice even if one
reference is a sequential scan and the other is an index lookup.

I have *not* been able to observe any significant effect from
POSIX_FADV_SEQUENTIAL but I'm not sure what circumstances it was a problem. It
sounds like it's a peculiar situation which is not easy to reliably reproduce.

Attachment Content-Type Size
bitmap-preread-v18.diff.gz application/octet-stream 13.6 KB

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, zb(at)cybertec(dot)at, postgres(at)cybertec(dot)at
Subject: Re: Initial prefetch performance testing
Date: 2008-09-23 15:25:15
Message-ID: Pine.GSO.4.64.0809231054190.22330@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 23 Sep 2008, Gregory Stark wrote:

> I have *not* been able to observe any significant effect from
> POSIX_FADV_SEQUENTIAL but I'm not sure what circumstances it was a problem. It
> sounds like it's a peculiar situation which is not easy to reliably reproduce.

Zoltan, Hans-Juergen: would it be possible for you to try the latest
bitmap-preread-v18.diff.gz patch Greg Stark just sent over to the list?
It's at
http://archives.postgresql.org/message-id/87ljxjudu7.fsf@oxford.xeocode.com
as well. That's a refinement of the original strategy you used, and I'd
be curious to hear whether it still works usefully on the troublesome
workload you submitted your original patch against. Since none of the
rest of us have been successful so far replicating the large speed-up on
multiple concurrent sequential scans you reported, I think you're the best
candidate to see if there was any regression because of how the patch was
refactored.

I'm excited to see index scans in the new patch as well, since I've got
1TB of test data that gets navigated that way I can test with.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-24 14:28:54
Message-ID: 200809241428.m8OESsi22353@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Ron Mayer wrote:
> Even more often on systems I see these days, "spindles"
> is an implementation detail that the DBA has no way to know
> what the correct value is.
>
> For example, on our sites hosted with Amazon's compute cloud (a great
> place to host web sites), I know nothing about spindles, but know
> about Amazon Elastic Block Store[2]'s and Instance Store's[1]. I
> have some specs and are able to run benchmarks on them; but couldn't
> guess how many spindles my X% of the N-disk device that corresponds
> to. For another example, some of our salesguys with SSD drives
> have 0 spindles on their demo machines.
>
> I'd rather a parameter that expressed things more in terms of
> measurable quantities -- perhaps seeks/second? perhaps
> random-access/sequential-access times?

I assume SAN users might not know the number of spindles either.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-24 14:33:06
Message-ID: 200809241433.m8OEX6E29070@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
>
> Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com> writes:
>
> > For example, on our sites hosted with Amazon's compute cloud (a great
> > place to host web sites), I know nothing about spindles, but know
> > about Amazon Elastic Block Store[2]'s and Instance Store's[1]. I
> > have some specs and are able to run benchmarks on them; but couldn't
> > guess how many spindles my X% of the N-disk device that corresponds
> > to.
>
> Well I don't see how you're going to guess how much prefetching is optimal for
> those environments either...
>
> > For another example, some of our salesguys with SSD drives
> > have 0 spindles on their demo machines.
>
> Sounds to me like you're finding it pretty intuitive. Actually you would want
> "1" because it can handle one request at a time. Actually if you have a
> multipath array I imagine you would want to think of each interface as a
> spindle because that's the bottleneck and you'll want to keep all the
> interfaces busy.

I assume everyone would want at least one because you would want the
page to be prefetched while you are processing the existing page. I
think a larger problem is that it is likely the page prefetch will take
longer than processing the existing page, which might mean that you need
to prefetch a few pages ahead to allow read reordering for better
performance.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-24 14:42:37
Message-ID: 48DA51DD.2060608@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> Ron Mayer wrote:
>> Even more often on systems I see these days, "spindles"
>> is an implementation detail that the DBA has no way to know
>> what the correct value is.
>>
>> For example, on our sites hosted with Amazon's compute cloud (a great
>> place to host web sites), I know nothing about spindles, but know
>> about Amazon Elastic Block Store[2]'s and Instance Store's[1]. I
>> have some specs and are able to run benchmarks on them; but couldn't
>> guess how many spindles my X% of the N-disk device that corresponds
>> to. For another example, some of our salesguys with SSD drives
>> have 0 spindles on their demo machines.
>>
>> I'd rather a parameter that expressed things more in terms of
>> measurable quantities -- perhaps seeks/second? perhaps
>> random-access/sequential-access times?
>
> I assume SAN users might not know the number of spindles either.

Yeah. Nevertheless I like the way effective_spindle_count works, as
opposed to an unintuitive "number of blocks to prefetch" (assuming the
formula we use to turn the former into latter works). Perhaps we should
keep the meaning the same, but call it "effective_io_concurrency"?
Something that conveys the idea of "how many simultaneous I/O requests
the I/O subsystem can handle", without referring to any specific
technology. That concept applies to SANs and RAM drives as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-09-24 15:15:39
Message-ID: 1222269339.4445.597.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Wed, 2008-09-24 at 17:42 +0300, Heikki Linnakangas wrote:

> Yeah. Nevertheless I like the way effective_spindle_count works, as
> opposed to an unintuitive "number of blocks to prefetch" (assuming the
> formula we use to turn the former into latter works). Perhaps we should
> keep the meaning the same, but call it "effective_io_concurrency"?
> Something that conveys the idea of "how many simultaneous I/O requests
> the I/O subsystem can handle", without referring to any specific
> technology. That concept applies to SANs and RAM drives as well.

You've spoiled all the fun now with a good suggestion.

I was looking forward to the Jules Verne-like nostalgia of the other
suggestion over the years to come.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial prefetch performance testing
Date: 2008-10-01 11:02:10
Message-ID: 87r6701sul.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:

> On Mon, 2008-09-22 at 16:46 +0100, Gregory Stark wrote:
>
>> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>>
>> > I'd prefer to set this as a tablespace level storage parameter.
>>
>> Sounds, like a good idea, except... what's a tablespace level storage parameter?
>
> A storage parameter, just at tablespace level.
>
> WITH (storage_parameter = value)

I still think this is a good idea but I still think there are unanswered
questions about it. Surely whatever we do with this parameter also holds for
random_page_cost and sequential_page_cost as well?

Should they remain GUC parameters at all? If so, how would setting any of them
locally interact with the tablespace parameter for tables used in the query?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!