Idea about estimating selectivity for single-column expressions

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Idea about estimating selectivity for single-column expressions
Date: 2009-08-19 02:53:34
Message-ID: 10499.1250650414@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

There were two different complaints today about the planner's inability
to make good estimates for WHERE conditions involving bit AND tests,
such as
(field & 8) = 0
(field & 8) <> 0
(field & 127) = field

I'm not particularly eager to try to put in special-case knowledge
for this sort of thing, but I had an idea that might cover a wide
enough variety of cases to be worth the trouble. It's this: if we
have a WHERE clause that is an immutable (or maybe even just stable)
function of a single table column, and we don't have any special-purpose
estimator that works for it, we could estimate the selectivity by
actually evaluating the clause for each MCV and histogram entry for
the column, then combining the results with appropriate weightings.
We do something comparable already for "field LIKE pattern" expressions,
but this would generalize it to any expression. With the larger
histogram sizes that are default as of 8.4, I think this could be
expected to provide a pretty good estimate most of the time ---
certainly far better than the hardwired default estimates are.

In the long run this might be workable for multi-column expressions
too, but not before we have multi-column statistics so we know which
combinations of values to try.

I can see a couple of possible downsides:

* If the expression is expensive to evaluate, we could waste a lot
of time making the estimate. This would only be a serious issue
if the query were designed to avoid evaluating the WHERE clause
very often, but that's certainly possible. We could perhaps check
the estimated cost of the expression and not do this if it's too
high ... but expression cost estimates are generally pretty bogus.

* The expression might throw an error for some inputs, for instance
(1 / field) < 0.5
which would fail on zero. We could recover by wrapping the whole
estimation process in a subtransaction, but that seems really expensive.
I thought about arguing that the failure would happen anyway at runtime,
but that doesn't hold water --- for example, the user might have just
deleted all the rows with field = 0, and would have grounds to complain
if the query failed; but there would still be an entry for zero in the
histogram.

Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-08-19 03:09:00 Re: Idea about estimating selectivity for single-column expressions
Previous Message Robert Haas 2009-08-19 02:53:23 Re: pg_hba.conf: samehost and samenet