Re: COUNT(*) and index-only scans

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Greg Stark <stark(at)mit(dot)edu>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: COUNT(*) and index-only scans
Date: 2011-10-12 14:18:16
Message-ID: CA+TgmoY70NrR==uf4evussiPn+EWKaHTJVUDZR7bB+DesxnkvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 12, 2011 at 9:52 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> What bothers me considerably more is the issue about how specific
> queries might see an all-visible fraction that's very substantially
> different from the table's overall ratio, especially in examples such as
> historical-data tables where most of the update and query activity has
> to do with recently-added rows.  I don't see any practical way to attack
> that problem with statistics; we're just going to have to adopt some
> estimation rule.  What I suggest as a first cut for that is: simply
> derate the visibility fraction as the fraction of the table expected to
> be scanned gets smaller.  That is, if the query fetches nearly all of
> the table, take the stored visibility ratio at face value; if it fetches
> only one block, never believe that that will be an all-visible block;
> and in general if we're expecting to read a fraction f of the pages,
> multiply the whole-table visibility ratio by f before using it in the
> cost estimate.  This amounts to assuming that the historical-data case
> is the usual case, but I'm not sure that's unfair.

I don't think that's an unfair assumption -- in fact I think it's
exactly the right assumption -- but I'm worried about how the math
works out with that specific proposal.

- Suppose VACUUM processes the table and makes it all-visible. Then,
somebody comes along and updates one tuple on every page, making them
all not-all-visible, but not trigger VACUUM because we're nowhere
close the 20% threshold. Now COUNT(*) will think it should use an
index-scan, but really... not so much. In fact, even if it's only
that a tuple has been updated on 25% of the pages, we're probably in
trouble.

- Suppose the table has a million rows and we're going to read 100 of
them, or 0.01%. Now it might appear that a covering index has a
negligible advantage over a non-covering index, but in fact I think we
still want to err on the side of trying to use the covering index. In
fact, even if we're only reading a single row, we probably still
generally want to pick up the covering index, to cater to the case
where someone is doing primary key fetches against a gigantic table
and hoping that index-only scans will save them from random I/O hell.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-10-12 14:29:39 Re: COUNT(*) and index-only scans
Previous Message Greg Stark 2011-10-12 14:16:54 Re: COUNT(*) and index-only scans