Re: random_page_cost vs seq_page_cost

Lists: pgsql-hackers
From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: random_page_cost vs seq_page_cost
Date: 2012-01-05 10:04:05
Message-ID: 20120105100405.GJ15490@ldn-qws-004.delacy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello list,

I have a question of how to benchmark hardware to determine
the appropriate ratio of seq_page_cost vs random_page_cost.

Emails in this mailing lists archive seem to indicate that
1.0 vs 3.0 - 4.0 are appropriate values on modern hardware.

Which surprised me a bit as I had thought that on actual
harddrives (ignoring SSDs) random_page_cost is higher.
I guess that the number tries to reflect caching of the
relevant pages in memory and modern hardware you have
more of that?

Anyhow it would be great to have a scientific way of
setting those numbers.

Background:

We have recently upgrade two of our biggest postgres databases
to new hardware and minor version number bump (8.4.5 -> 8.4.9).

We are experiencing a big performance regression in some queries.
In those cases the planner seems to choose a nested loop index
scan instead of hashing the index once and then joining.

The new hardware was optimized for seq scans and does those
very fast.

We are not sure if the database used to choose differently
before the move to the new hardware and the hardware is
performing worse for random seeks. Or if the planner is
now making different choices.

As a counter measure we are experimenting with
enable_nestloop = off
random_page_cost = 20 (instead of the previous 4).

It is worth noting that for many small tables the nestloop
is indeed marginally faster (in the doesn't really matter
because both cases are fast enough case). But the regression
for the big tables (big in the sense of index just fits into
memory but in practice might not because there other frequently
accessed big tables) is a show stopper.

For some of those tables we have also have recently (as part
of the move) clustered for the first time in ages and it was
speculated that that might have changed statistics (such
as correlation) and increased the attractiveness of the
index scan to the planner.

Another thing that I have thought before might be provide
some enlightenment would be a
explain log select ...

That would show all the sub plans considered and why they
were discarded or something approximating this.

Thanks in advance for any reply and sorry that this email
turned out to be rather long stream of consciousness dump.

Bene

------ relevant parts of our postgresql.conf -----------

shared_buffers = 12GB # min 128kB
# (change requires restart)
temp_buffers = 512MB # min 800kB
#max_prepared_transactions = 0 # zero disables the feature
# (change requires restart)
# Note: Increasing max_prepared_transactions costs ~600 bytes of shared memory
# per transaction slot, plus lock space (see max_locks_per_transaction).
# It is not advisable to set max_prepared_transactions nonzero unless you
# actively intend to use prepared transactions.
work_mem = 192MB # min 64kB
maintenance_work_mem = 1GB # min 1MB
#max_stack_depth = 2MB # min 100kB

# - Kernel Resource Usage -

#max_files_per_process = 1000 # min 25
# (change requires restart)
#shared_preload_libraries = '' # (change requires restart)

# - Cost-Based Vacuum Delay -

vacuum_cost_delay = 100ms # 0-100 milliseconds
vacuum_cost_page_hit = 1 # 0-10000 credits
vacuum_cost_page_miss = 10 # 0-10000 credits
vacuum_cost_page_dirty = 20 # 0-10000 credits
vacuum_cost_limit = 7500 # 1-10000 credits

# - Background Writer -

#bgwriter_delay = 200ms # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100 # 0-1000 max buffers written/round
#bgwriter_lru_multiplier = 2.0 # 0-10.0 multipler on buffers scanned/round

# - Asynchronous Behavior -

effective_io_concurrency = 40 # 1-1000. 0 disables prefetching

# WRITE AHEAD LOG

fsync = on # turns forced synchronization on or off
synchronous_commit = off # immediate fsync at commit
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
# fdatasync
# fsync
# fsync_writethrough
# open_sync
full_page_writes = on # recover from partial page writes
wal_buffers = 16MB # min 32kB
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds

commit_delay = 1000 # range 0-100000, in microseconds
commit_siblings = 5 # range 1-1000

# - Checkpoints -

checkpoint_segments = 128 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 10min # range 30s-1h
checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0
checkpoint_warning = 30s # 0 disables

# - Archiving -

archive_mode = off # allows archiving to be done

# QUERY TUNING

# - Planner Method Configuration -

#enable_bitmapscan = on
#enable_hashagg = on
#enable_hashjoin = on
#enable_indexscan = on
#enable_mergejoin = on
#enable_nestloop = on
#enable_seqscan = on
#enable_sort = on
#enable_tidscan = on

# - Planner Cost Constants -

seq_page_cost = 1.0 # measured on an arbitrary scale
random_page_cost = 20.0 # same scale as above
#cpu_tuple_cost = 0.01 # same scale as above
#cpu_index_tuple_cost = 0.005 # same scale as above
#cpu_operator_cost = 0.0025 # same scale as above
effective_cache_size = 32GB # dpowers: set to 2/3 of available ram

# - Genetic Query Optimizer -

#geqo = on
#geqo_threshold = 12
#geqo_effort = 5 # range 1-10
#geqo_pool_size = 0 # selects default based on effort
#geqo_generations = 0 # selects default based on effort
#geqo_selection_bias = 2.0 # range 1.5-2.0

# - Other Planner Options -

default_statistics_target = 1000 # range 1-10000
#constraint_exclusion = partition # on, off, or partition
cursor_tuple_fraction = 1.0 # range 0.0-1.0
#from_collapse_limit = 8
#join_collapse_limit = 8 # 1 disables collapsing of explicit
# JOIN clauses

Machine:
OS: linux 2.6.32-71.29.1
CPU 12 x Intel(R) Xeon(R) CPU X5680 @ 3.33GHz Cache size: 12288 KB
RAM: 47.12 GB


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-01-05 14:32:30
Message-ID: CA+TgmoY4DBtqTX+ee8CMA55aoY169EX2wdXRSAv6twvdwg4TxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 5, 2012 at 5:04 AM, Benedikt Grundmann
<bgrundmann(at)janestreet(dot)com> wrote:
> We are experiencing a big performance regression in some queries.
> In those cases the planner seems to choose a nested loop index
> scan instead of hashing the index once and then joining.

I think you probably need to post EXPLAIN ANALYZE output from the
actual queries to get useful advice, probably to pgsql-performance,
rather than here.

It's hard to believe that enable_nestloop=off is doing anything other
than masking whatever the real problem is, but it's hard to tell what
that problem is based on the information provided.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-01-05 14:34:27
Message-ID: 20120105143427.GG25235@ldn-qws-004.delacy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 05/01/12 10:04, Benedikt Grundmann wrote:
>
> As a counter measure we are experimenting with
> enable_nestloop = off
> random_page_cost = 20 (instead of the previous 4).
>
For what it is worth we had to revert the enable_nestloop = off
change. It just moved the pain around by making other queries
perform much worse than before.


From: Jeremy Harris <jgh(at)wizmail(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-01-05 23:00:47
Message-ID: 4F062B9F.3050001@wizmail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2012-01-05 10:04, Benedikt Grundmann wrote:
> I have a question of how to benchmark hardware to determine
> the appropriate ratio of seq_page_cost vs random_page_cost.

It'd be really nice if the DBMS measured actual experienced values......

--
Jeremy


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-01-06 00:17:20
Message-ID: 4F063D90.2090405@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/5/12 3:00 PM, Jeremy Harris wrote:
> On 2012-01-05 10:04, Benedikt Grundmann wrote:
>> I have a question of how to benchmark hardware to determine
>> the appropriate ratio of seq_page_cost vs random_page_cost.
>
> It'd be really nice if the DBMS measured actual experienced values......

Certainly it would. It would also require a whole lot of
instrumentation. Feel free to write some ...

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-01-07 21:01:26
Message-ID: 1325970086.14090.1.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On tor, 2012-01-05 at 10:04 +0000, Benedikt Grundmann wrote:
> We have recently upgrade two of our biggest postgres databases
> to new hardware and minor version number bump (8.4.5 -> 8.4.9).
>
> We are experiencing a big performance regression in some queries.
> In those cases the planner seems to choose a nested loop index
> scan instead of hashing the index once and then joining.

There was a planner regression introduced in version 8.4.8, which was
thought to be fixed in 8.4.9. Maybe you got caught by that. See

Message-Id: <760C0206-B5F4-4DC6-9296-B7A730B7F403(at)silentmedia(dot)com>

for some information. Check if your queries match that pattern.


From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-01-09 16:05:46
Message-ID: 20120109160546.GF6419@ldn-qws-004.delacy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 07/01/12 23:01, Peter Eisentraut wrote:
> On tor, 2012-01-05 at 10:04 +0000, Benedikt Grundmann wrote:
> > We have recently upgrade two of our biggest postgres databases
> > to new hardware and minor version number bump (8.4.5 -> 8.4.9).
> >
> > We are experiencing a big performance regression in some queries.
> > In those cases the planner seems to choose a nested loop index
> > scan instead of hashing the index once and then joining.
>
> There was a planner regression introduced in version 8.4.8, which was
> thought to be fixed in 8.4.9. Maybe you got caught by that. See
>
> Message-Id: <760C0206-B5F4-4DC6-9296-B7A730B7F403(at)silentmedia(dot)com>
>
> for some information. Check if your queries match that pattern.

Good idea. But that is not it. We checked by using 8.4.5 on
the new hardware (and the new database) which produced the same
(bad) plans as 8.4.10 (with both the old and the new postgres config).

We are again speculating that it might be:
> > For some of those tables we have also have recently (as part
> > of the move) clustered for the first time in ages and it was
> > speculated that that might have changed statistics (such
> > as correlation) and increased the attractiveness of the
> > index scan to the planner.

Is that possible? If so what is the best way to prove / disprove
this theory? And ideally if true what knobs are available to tune
this?

Thanks,

Bene


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-01-10 20:22:41
Message-ID: 4F0C9E11.8090900@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/5/12 5:04 AM, Benedikt Grundmann wrote:
> I have a question of how to benchmark hardware to determine
> the appropriate ratio of seq_page_cost vs random_page_cost.
>
> Emails in this mailing lists archive seem to indicate that
> 1.0 vs 3.0 - 4.0 are appropriate values on modern hardware.
>
> Which surprised me a bit as I had thought that on actual
> harddrives (ignoring SSDs) random_page_cost is higher.
> I guess that the number tries to reflect caching of the
> relevant pages in memory and modern hardware you have
> more of that?

That sort of thing is one reason why all attempts so far to set
random_page_cost based on physical characteristics haven't gone anywhere
useful. The setting is sort of overloaded right now, it's a fuzzy mix
of true random seek cost blended with some notion of cache percentage.
Trying to bring some measurements to bear on it is a less effective
approach than what people actually do here. Monitor the profile of
query execution, change the value, see what happens. Use that as
feedback for what direction to keep going; repeat until you're just
spinning with no improvements.

It's easy to measure the actual read times and set the value based on
that instead. But that doesn't actually work out so well. There's at
least three problems in that area:

-Timing information is sometimes very expensive to collect. This I
expect to at least document and quantify why usefully as a 9.2 feature.

-Basing query execution decisions on what is already in the cache leads
to all sorts of nasty feedback situations where you optimize for the
short term, for example using an index already in cache, while never
reading in what would be a superior long term choice because it seems
too expensive.

-Making a major adjustment to the query planning model like this would
require a large performance regression testing framework to evaluate the
results in.

> We are not sure if the database used to choose differently
> before the move to the new hardware and the hardware is
> performing worse for random seeks. Or if the planner is
> now making different choices.

I don't recommend ever deploying new hardware without first doing some
low-level benchmarks to validate its performance. Once stuff goes into
production, you can't do that anymore. See
http://www.2ndquadrant.com/en/talks/ for my hardware benchmarking talks
if you'd like some ideas on what to collect.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-01-11 08:26:52
Message-ID: 20120111082651.GV6419@ldn-qws-004.delacy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(replying just to you)
On 10/01/12 15:22, Greg Smith wrote:
> On 1/5/12 5:04 AM, Benedikt Grundmann wrote:
> That sort of thing is one reason why all attempts so far to set
> random_page_cost based on physical characteristics haven't gone
> anywhere useful. The setting is sort of overloaded right now, it's a
> fuzzy mix of true random seek cost blended with some notion of cache
> percentage. Trying to bring some measurements to bear on it is a less
> effective approach than what people actually do here. Monitor the
> profile of query execution, change the value, see what happens. Use
> that as feedback for what direction to keep going; repeat until
> you're just spinning with no improvements.
>
Thank you very much for the reply it is very interesting. I'm
excited to hear that documentation in that area will improve in
9.2. It's interesting postgres has remarkable good documentation
but it is a sufficiently complex system that to actually sensible
tune the knobs provided you have to understand quite a lot about
what is going on. A colleague of mine likes to say
"all abstractions leak", which seems very appropriate in this case.

> >We are not sure if the database used to choose differently
> >before the move to the new hardware and the hardware is
> >performing worse for random seeks. Or if the planner is
> >now making different choices.
>
> I don't recommend ever deploying new hardware without first doing
> some low-level benchmarks to validate its performance. Once stuff
> goes into production, you can't do that anymore. See
> http://www.2ndquadrant.com/en/talks/ for my hardware benchmarking
> talks if you'd like some ideas on what to collect.
>
We had actually done lots of tests on the sequential read performance.
But you are right we could have done better (and I'll definitely read
through your talks).

Did you see my follow up? Based on the feedback we did further tests
and It is now clear that neither the hardware nor the database version
are at fault. A different plan is chosen by both new and old database
version if spun up on the database as it is right now.

Our best guess is that the clusters we run after we had moved to the
hardware (it having more diskspace and faster sequential I/O making
it possible) changed the planners perception of how the joins will
perform in relation to each other.

Cheers,

Bene


From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-01-11 08:29:02
Message-ID: 20120111082902.GW6419@ldn-qws-004.delacy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/01/12 08:26, Benedikt Grundmann wrote:
> (replying just to you)
Clearly I didn't. Sigh. Getting myself a coffee now.


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
Cc: Greg Smith <greg(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-02-07 20:23:47
Message-ID: 20120207202347.GA7929@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 11, 2012 at 08:26:52AM +0000, Benedikt Grundmann wrote:
> (replying just to you)
> On 10/01/12 15:22, Greg Smith wrote:
> > On 1/5/12 5:04 AM, Benedikt Grundmann wrote:
> > That sort of thing is one reason why all attempts so far to set
> > random_page_cost based on physical characteristics haven't gone
> > anywhere useful. The setting is sort of overloaded right now, it's a
> > fuzzy mix of true random seek cost blended with some notion of cache
> > percentage. Trying to bring some measurements to bear on it is a less
> > effective approach than what people actually do here. Monitor the
> > profile of query execution, change the value, see what happens. Use
> > that as feedback for what direction to keep going; repeat until
> > you're just spinning with no improvements.
> >
> Thank you very much for the reply it is very interesting. I'm
> excited to hear that documentation in that area will improve in
> 9.2. It's interesting postgres has remarkable good documentation
> but it is a sufficiently complex system that to actually sensible
> tune the knobs provided you have to understand quite a lot about
> what is going on. A colleague of mine likes to say
> "all abstractions leak", which seems very appropriate in this case.

Where did you see that there will be an improvement in the 9.2
documentation? I don't see an improvement.

I looked over the random_page_cost documentation and remembered I was
always concerned about how vague it was about caching effects, so I
wrote the attached doc patch to explicity state it.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachment Content-Type Size
random text/plain 746 bytes

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To:
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-02-07 22:06:18
Message-ID: 4F31A05A.1060506@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/07/2012 03:23 PM, Bruce Momjian wrote:
> Where did you see that there will be an improvement in the 9.2
> documentation? I don't see an improvement.

I commented that I'm hoping for an improvement in the documentation of
how much timing overhead impacts attempts to measure this area better.
That's from the "add timing of buffer I/O requests" feature submission.
I'm not sure if Bene read too much into that or not; I didn't mean to
imply that the docs around random_page_cost have gotten better.

This particular complaint is extremely common though, seems to pop up on
one of the lists a few times each year. Your suggested doc fix is fine
as a quick one, but I think it might be worth expanding further on this
topic. Something discussing SSDs seems due here too. Here's a first
draft of a longer discussion, to be inserted just after where it states
the default value is 4.0:

True random access to mechanical disk storage will normally be more
expensive than this default suggests. The value used is lower to
reflect caching effects. Some common random accesses to disk, such as
indexed reads, are considered likely to be in cache. The default value
can be thought of as modeling random access as 40 times as expensive as
sequential, while expecting that 90% of random reads will actually be
cached.

If you believe a high cache rate is an incorrect assumption for your
workload, you might increase random_page_cost to closer reflect the true
cost of random reads against your storage. Correspondingly, if your
data is likely to be completely cached, such as when the database is
smaller than the total memory in the server, decreasing random_page_cost
can be appropriate. Storage where the true cost of random reads is low,
such as solid-state drives and similar memory-based devices, might also
find lower values of random_page_cost better reflect the real-world cost
of that operation.

===

I think of the value as being more like 80 times as expensive and a 95%
hit rate, but the above seems more likely to turn into understandable
math to a first-time reader of this section. I stopped just short of
recommending a value for the completely cached case. I normally use
1.01 there; I know others prefer going fully to 1.0 instead. That
argument seems like it could rage on for some time.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-02-08 00:58:28
Message-ID: 20120208005828.GB17580@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 07, 2012 at 05:06:18PM -0500, Greg Smith wrote:
> On 02/07/2012 03:23 PM, Bruce Momjian wrote:
> >Where did you see that there will be an improvement in the 9.2
> >documentation? I don't see an improvement.
>
> I commented that I'm hoping for an improvement in the documentation
> of how much timing overhead impacts attempts to measure this area
> better. That's from the "add timing of buffer I/O requests" feature
> submission. I'm not sure if Bene read too much into that or not; I
> didn't mean to imply that the docs around random_page_cost have
> gotten better.
>
> This particular complaint is extremely common though, seems to pop
> up on one of the lists a few times each year. Your suggested doc
> fix is fine as a quick one, but I think it might be worth expanding
> further on this topic. Something discussing SSDs seems due here
> too. Here's a first draft of a longer discussion, to be inserted
> just after where it states the default value is 4.0:

I was initially concerned that tuning advice in this part of the docs
would look out of place, but now see the 25% shared_buffers
recommentation, and it looks fine, so we are OK. (Should we caution
against more than 8GB of shared buffers? I don't see that in the docs.)

I agree we are overdue for better a explanation of random page cost, so
I agree with your direction. I did a little word-smithing to tighten up
your text; feel free to discard what you don't like:

Random access to mechanical disk storage is normally much more expensive
than four-times sequential access. However, a lower default is used
(4.0) because the majority of random accesses to disk, such as indexed
reads, are assumed to be in cache. The default value can be thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.

If you believe a 90% cache rate is an incorrect assumption
for your workload, you can increase random_page_cost to better
reflect the true cost of random storage reads. Correspondingly,
if your data is likely to be completely in cache, such as when
the database is smaller than the total server memory, decreasing
random_page_cost can be appropriate. Storage that has a low random
read cost relative to sequential, e.g. solid-state drives, might
also be better modeled with a lower value for random_page_cost.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Greg Smith <greg(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-02-08 09:28:49
Message-ID: 20120208092849.GC12111@ldn-qws-004.delacy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 07/02/12 19:58, Bruce Momjian wrote:
> On Tue, Feb 07, 2012 at 05:06:18PM -0500, Greg Smith wrote:
> > On 02/07/2012 03:23 PM, Bruce Momjian wrote:
> > >Where did you see that there will be an improvement in the 9.2
> > >documentation? I don't see an improvement.
> >
> > I commented that I'm hoping for an improvement in the documentation
> > of how much timing overhead impacts attempts to measure this area
> > better. That's from the "add timing of buffer I/O requests" feature
> > submission. I'm not sure if Bene read too much into that or not; I
> > didn't mean to imply that the docs around random_page_cost have
> > gotten better.

I guess I did. But I'm very glad that as a side effect Bruce and Greg
have improved it ;-)


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-02-12 00:49:27
Message-ID: CAMkU=1w0AkPtMXddQi5PWft5-uFWFq0WWiDqgfE6+_0QHpvBtg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 7, 2012 at 2:06 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 02/07/2012 03:23 PM, Bruce Momjian wrote:
>>
>> Where did you see that there will be an improvement in the 9.2
>> documentation?  I don't see an improvement.
>
>
> I commented that I'm hoping for an improvement in the documentation of how
> much timing overhead impacts attempts to measure this area better.  That's
> from the "add timing of buffer I/O requests" feature submission.  I'm not
> sure if Bene read too much into that or not; I didn't mean to imply that the
> docs around random_page_cost have gotten better.
>
> This particular complaint is extremely common though, seems to pop up on one
> of the lists a few times each year.  Your suggested doc fix is fine as a
> quick one, but I think it might be worth expanding further on this topic.
>  Something discussing SSDs seems due here too.  Here's a first draft of a
> longer discussion, to be inserted just after where it states the default
> value is 4.0:
>
> True random access to mechanical disk storage will normally be more
> expensive than this default suggests.  The value used is lower to reflect
> caching effects.  Some common random accesses to disk, such as indexed
> reads, are considered likely to be in cache.  The default value can be
> thought of as modeling random access as 40 times as expensive as sequential,
> while expecting that 90% of random reads will actually be cached.

For these numbers to work out to 4, we must also be assuming that
virtually zero of the sequentially read pages are cached. Is that a
realistic assumption? If the table is accessed only by seq scans, the
ring buffer may prevent it from getting cached (although even then it
could very well be the OS cache as that doesn't respect the ring
buffer), but it would be pretty common for other parts of the
application to access the same table via index scan, and so cause
substantial parts of it to be cached.

But I can see that that would rapidly get too complicated to discuss
in the documentation.

Cheers,

Jeff


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-02-12 00:53:25
Message-ID: CAMkU=1yDvRJ-C1bdcdpV8dXqWKn4KeeX3tZXTaOttB8q6kHH+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 7, 2012 at 4:58 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> I was initially concerned that tuning advice in this part of the docs
> would look out of place, but now see the 25% shared_buffers
> recommentation, and it looks fine, so we are OK.  (Should we caution
> against more than 8GB of shared buffers?  I don't see that in the docs.)

Has it ever been well-characterized what the problem is with >8GB?
I've used shared buffers above that size for testing purposes and
could never provoke a problem with it.

Cheers,

Jeff


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-02-13 02:04:28
Message-ID: 4F386FAC.1090003@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/11/2012 07:53 PM, Jeff Janes wrote:
> Has it ever been well-characterized what the problem is with>8GB?
> I've used shared buffers above that size for testing purposes and
> could never provoke a problem with it.

If anyone ever manages to characterize it well, we might actually make
progress on isolating and fixing it. All we have so far are a couple of
application level test results suggesting a higher value caused
performance to drop. The first public one I remember was from Jignesh;
http://archives.postgresql.org/pgsql-performance/2008-02/msg00184.php
gives him quoting on where he found the Solaris roll-off was at. What
we really need to stomp this one down is someone to find the same thing,
then show profiler output in each case. Note that Jignesh's report
included significant amount of filesystem level tuning, using things
like more direct I/O, and that might be a necessary requirement to run
into the exact variant of this limitation he mentioned.

I haven't spent a lot of time looking for this problem myself. What
I've heard second-hand from more than one person now is a) larger
settings than 8GB can be an improvement for some people still, and b)
simple benchmarks don't always have this problem. I have noted that the
few public and private reports I've gotten all suggest problems show up
on benchmarks of more complicated workloads. I think Jignesh mentioned
this being obvious in the more complicated TPC-derived benchmarks, not
in simple things like pgbench. I may be misquoting him though. And
given that one of the possible causes for this was an excess of some
lock contention, it's quite possible this one is already gone from 9.2,
given the large number of lock related issues that have been squashed so
far in this release.

All of those disclaimers are why I think no one has pushed to put a note
about this in the official docs. Right now the only suggested limit is
this one:

"The useful range for shared_buffers on Windows systems is generally
from 64MB to 512MB."

The most common practical limit I've run into with large shared_buffers
settings hits earlier than 8GB: running into checkpoint spike issues.
I have installs that started with shared_buffers in the 4 to 8GB range,
where we saw badly spiking I/O at checkpoint sync time. Lowering the
databases cache can result in smarter writing decisions withing the OS,
improving latency--even though total writes are actually higher if you
measure what flows from the database to OS. That side of the latency
vs. throughput trade-off existing is one of the main reasons I haven't
gone chasing after problems with really large shared_buffers settings.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: random_page_cost vs seq_page_cost
Date: 2012-02-14 21:52:33
Message-ID: 20120214215233.GA3424@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 07, 2012 at 07:58:28PM -0500, Bruce Momjian wrote:
> I was initially concerned that tuning advice in this part of the docs
> would look out of place, but now see the 25% shared_buffers
> recommentation, and it looks fine, so we are OK. (Should we caution
> against more than 8GB of shared buffers? I don't see that in the docs.)
>
> I agree we are overdue for better a explanation of random page cost, so
> I agree with your direction. I did a little word-smithing to tighten up
> your text; feel free to discard what you don't like:
>
> Random access to mechanical disk storage is normally much more expensive
> than four-times sequential access. However, a lower default is used
> (4.0) because the majority of random accesses to disk, such as indexed
> reads, are assumed to be in cache. The default value can be thought of
> as modeling random access as 40 times slower than sequential, while
> expecting 90% of random reads to be cached.
>
> If you believe a 90% cache rate is an incorrect assumption
> for your workload, you can increase random_page_cost to better
> reflect the true cost of random storage reads. Correspondingly,
> if your data is likely to be completely in cache, such as when
> the database is smaller than the total server memory, decreasing
> random_page_cost can be appropriate. Storage that has a low random
> read cost relative to sequential, e.g. solid-state drives, might
> also be better modeled with a lower value for random_page_cost.

Patch applied for random_page_cost docs.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +