Re: pgsql: When estimating the selectivity of an inequality "column >

Lists: pgsql-committerspgsql-hackers
From: tgl(at)postgresql(dot)org (Tom Lane)
To: pgsql-committers(at)postgresql(dot)org
Subject: pgsql: When estimating the selectivity of an inequality "column >
Date: 2010-01-04 02:44:40
Message-ID: 20100104024440.7C621753FB7@cvs.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Log Message:
-----------
When estimating the selectivity of an inequality "column > constant" or
"column < constant", and the comparison value is in the first or last
histogram bin or outside the histogram entirely, try to fetch the actual
column min or max value using an index scan (if there is an index on the
column). If successful, replace the lower or upper histogram bound with
that value before carrying on with the estimate. This limits the
estimation error caused by moving min/max values when the comparison
value is close to the min or max. Per a complaint from Josh Berkus.

It is tempting to consider using this mechanism for mergejoinscansel as well,
but that would inject index fetches into main-line join estimation not just
endpoint cases. I'm refraining from that until we can get a better handle
on the costs of doing this type of lookup.

Modified Files:
--------------
pgsql/src/backend/executor:
nodeHash.c (r1.125 -> r1.126)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/executor/nodeHash.c?r1=1.125&r2=1.126)
pgsql/src/backend/tsearch:
ts_selfuncs.c (r1.6 -> r1.7)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/tsearch/ts_selfuncs.c?r1=1.6&r2=1.7)
pgsql/src/backend/utils/adt:
selfuncs.c (r1.266 -> r1.267)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/selfuncs.c?r1=1.266&r2=1.267)
pgsql/src/backend/utils/cache:
lsyscache.c (r1.165 -> r1.166)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/cache/lsyscache.c?r1=1.165&r2=1.166)
pgsql/src/include/utils:
lsyscache.h (r1.130 -> r1.131)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/include/utils/lsyscache.h?r1=1.130&r2=1.131)


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)postgresql(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pgsql: When estimating the selectivity of an inequality "column >
Date: 2010-01-04 13:29:52
Message-ID: 407d949e1001040529k9ec48a8vd0b6d54c33001f06@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Jan 4, 2010 at 2:44 AM, Tom Lane <tgl(at)postgresql(dot)org> wrote:
> Log Message:
> -----------
> When estimating the selectivity of an inequality "column > constant" or
> "column < constant", and the comparison value is in the first or last
> histogram bin or outside the histogram entirely, try to fetch the actual
> column min or max value using an index scan (if there is an index on the
> column).  If successful, replace the lower or upper histogram bound with
> that value before carrying on with the estimate.  This limits the
> estimation error caused by moving min/max values when the comparison
> value is close to the min or max.  Per a complaint from Josh Berkus.

So.... This is pretty cool but it's worth discussing the downsides.
This is the second bit of live information the planning process is
peeking at to supplement the statistics (the first is the physical
size of the table). There are two use cases that I see being
negatively impacted by these features, neither of which we support
currently but I expect we'll eventually support and when we do we'll
have to work around these problems. The two problematic use cases I
see are: plan stability and exporting statistics to a another machine
to reproduce behaviour on a test machine.

I'm not sure what approach we'll have to take to work around these
problems. Perhaps we just need a way to disable checking these bits of
information and have a fallback strategy? perhaps we need a hook for
each of these fetches so you can store the size and upper bound of the
table on the production system you're trying to replicate the
behaviour of and a hook which substitutes these values for the dynamic
value? Perhaps the hook could even use a dbilink connection to get the
live production values.

I also wonder if we want to run these lookups on every single planner
invocation. If the table is being rarely updated analyze will never
fire and the stats value will never be updated. I don't think we want
the planner doing write operations to the stats either though. It
would be nice if there was a low-cost mode of operation for ANALYZE
which simulates precisely this update, the planner could send a stats
message saying that autovacuum should schedule one of these operations
sometime even if it doesn't see a need for a full analyze run.

--
greg


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pgsql: When estimating the selectivity of an inequality "column >
Date: 2010-01-04 14:50:21
Message-ID: 20100104145021.GB3778@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Greg Stark wrote:

> I also wonder if we want to run these lookups on every single planner
> invocation. If the table is being rarely updated analyze will never
> fire and the stats value will never be updated. I don't think we want
> the planner doing write operations to the stats either though. It
> would be nice if there was a low-cost mode of operation for ANALYZE
> which simulates precisely this update, the planner could send a stats
> message saying that autovacuum should schedule one of these operations
> sometime even if it doesn't see a need for a full analyze run.

Maybe autovac could run such a cheap ANALYZE frequently on tables with
large number of inserts (but not large enough to trigger a regular
ANALYZE) ... say a fixed number of tuples (not depending on pg_class.reltuples)

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pgsql: When estimating the selectivity of an inequality "column >
Date: 2010-01-04 15:50:03
Message-ID: 407d949e1001040750t5fc5b52ctf6f5233b03413617@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On Mon, Jan 4, 2010 at 2:50 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Maybe autovac could run such a cheap ANALYZE frequently on tables with
> large number of inserts (but not large enough to trigger a regular
> ANALYZE) ... say a fixed number of tuples (not depending on pg_class.reltuples)
>

Well that might cut down on the number of plans that need to do it
themselves. But I'm more concerned about a database that *doesn't*
have a frequent number of inserts. Such a database should presumably
trigger a real analyze reasonably quickly.

But consider a database that has one new record inserted per day but
thousands of queries per minute looking up the maximum value in the
table. This change has basically doubled the work that query needs to
do since the planner now needs to do the same lookup that the query
itself was going to do. And autovacuum won't fire for a long long time
against this table.

Admittedly the fact that there is 100% overhead isn't terribly
interesting since it's really a fixed overhead and only 100% if that
query happens to be correspondingly cheap. But it's still annoying to
me that we'll potentially never figure out what the new stats should
be and stop doing the lookup no matter how long the new row sits there
unless some unrelated activity triggers a real analyze.

--
greg