Re: [WIP] cache estimates, cache access cost

From: Greg Stark <stark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] cache estimates, cache access cost
Date: 2011-06-19 13:38:21
Message-ID: BANLkTin8rVCrFEEf4OrwzQwLFqEzyQ3Xdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> 1. ANALYZE happens far too infrequently to believe that any data taken
> at ANALYZE time will still be relevant at execution time.
> 2. Using data gathered by ANALYZE will make plans less stable, and our
> users complain not infrequently about the plan instability we already
> have, therefore we should not add more.
> 3. Even if the data were accurate and did not cause plan stability, we
> have no evidence that using it will improve real-world performance.

I feel like this is all baseless FUD. ANALYZE isn't perfect but it's
our interface for telling postgres to gather stats and we generally
agree that having stats and modelling the system behaviour as
accurately as practical is the right direction so we need a specific
reason why this stat and this bit of modeling is a bad idea before we
dismiss it.

I think the kernel of truth in these concerns is simply that
everything else ANALYZE looks at mutates only on DML. If you load the
same data into two databases and run ANALYZE you'll get (modulo random
sampling) the same stats. And if you never modify it and analyze it
again a week later you'll get the same stats again. So autovacuum can
guess when to run analyze based on the number of DML operations, it
can run it without regard to how busy the system is, and it can hold
off on running it if the data hasn't changed.

In the case of the filesystem buffer cache the cached percentage will
vary over time regardless of whether the data changes. Plain select
queries will change it, even other activity outside the database will
change it. There are a bunch of strategies for mitigating this
problem: we might want to look at the cache situation more frequently,
discount the results we see since more aggressively, and possibly
maintain a kind of running average over time.

There's another problem which I haven't seen mentioned. Because the
access method will affect the cache there's the possibility of
feedback loops. e.g. A freshly loaded system prefers sequential scans
for a given table because without the cache the seeks of random reads
are too expensive... causing it to never load that table into cache...
causing that table to never be cached and never switch to an index
method. It's possible there are mitigation strategies for this as well
such as keeping a running average over time and discounting the
estimates with some heuristic values.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2011-06-19 13:48:58 Re: [WIP] Support for "ANY/ALL(array) op scalar" (Was: Re: Boolean operators without commutators vs. ALL/ANY)
Previous Message Andrew Dunstan 2011-06-19 13:26:49 Re: Re: [COMMITTERS] pgsql: Don't use "cp -i" in the example WAL archive_command.