Re: COUNT(*) and index-only scans

From: Garick Hamlin <ghamlin(at)isc(dot)upenn(dot)edu>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COUNT(*) and index-only scans
Date: 2011-10-12 18:33:46
Message-ID: 20111012183346.GA17896@isc.upenn.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 12, 2011 at 03:16:54PM +0100, Greg Stark wrote:
> On Wed, Oct 12, 2011 at 2:52 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > I think it's overkill, and possibly unpleasantly unstable as well.
> > For the initial attack on this we should just have VACUUM and ANALYZE
> > count the number of all-visible blocks and store that in pg_class along
> > with the tuple-count statistics.  There's no reason to think that this
> > will be any worse than the way we deal with dead tuple counts, for
> > instance.
>
> So I have a theory.
>
> Assuming you're in a steady-state situation the amount of all-visible
> blocks will fluctuate from a high just after vacuum to a low just
> before the next vacuum. There are other ways a block can be marked
> all-visible but for the most part I would expect the fraction to go
> steadily down until vacuum comes along and cleans things up.
>
> So if vacuum tracked the fraction of blocks marked all-visible
> *before* it processed them and the fraction it marked all-visible
> after processing we have an upper and lower bound. If we knew how long
> it's been since vacuum we could interpolate between those, or we could
> just take the mean, or we could take the lower bound as a conservative
> estimate.
>
> > What I suggest as a first cut for that is: simply derate the visibility fraction as the fraction
> > of the table expected to be scanned gets smaller.
>
> I think there's a statistically more rigorous way of accomplishing the
> same thing. If you treat the pages we estimate we're going to read as
> a random sample of the population of pages then your expected value is
> the fraction of the overall population that is all-visible but your
> 95th percentile confidence interval will be, uh, a simple formula we
> can compute but I don't recall off-hand.

Incidentally, I had a similar idea at PGCon relating to planning...

My idea was to compute not just the cost but the sensitivity
of the cost an estimate for each plan. The sensitivity is the
derivate of the cost. So, if the total cost was n^2 the sensitivity
would be 2n.

If you picked a tolerance (like 2 standard deviations) of the
observed distribution. You could compare the expected cost and the
expected 'unlucky cost' for plans.

(Basically, this is parametric error propagation)

I know very little about the planner...
I don't know how how often it would lead to picking a better
plan (it might not be worth the cost to compute), but it seemed
like an interesting approach to me.

Garick

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2011-10-12 18:39:45 Re: [BUGS] *.sql contrib files contain unresolvable MODULE_PATHNAME
Previous Message Tom Lane 2011-10-12 18:29:17 Re: [BUGS] *.sql contrib files contain unresolvable MODULE_PATHNAME