Re: Statistics and selectivity estimation for ranges

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Statistics and selectivity estimation for ranges
Date: 2013-03-12 11:39:42
Message-ID: 513F13FE.3020509@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01.03.2013 16:22, Alexander Korotkov wrote:
> These changes were made in attached patch.

Thanks.

I've been staring at this code for a very long time now, trying to
understand how the math in calc_hist_selectivity_contained works. I
think I understand it now, but it probably needs a lot more comments and
perhaps some refactoring, so that the next reader won't need to spend
hours deciphering it.

I'll walk through an example of a calc_hist_selectivity_contained
invocation, to verify that my understanding is correct. This isn't 100%
identical to how the function works; I explain it as if it holds some
temporary bins in memory and modifies them in steps, but in reality, it
keeps the bin distances and some other information only for the
current/previous bin it's processing, in local variables.

Assume that the query is "col <@ int4range(15, 50)", and the lower
bounds histogram is (10, 20, 40, 100, 120). Visually, the histogram
looks like this:

Boundary 10 20 40 100 120
-+------+------+------+------+-
Fraction 0.25 0.25 0.25 0.25

Each bin, 10-20, 20-40, 40-100 and 100-120, contains the same
proportion, 25%, of all the tuples in the table. The function first
finds the bins containing the lower and upper bounds, 15 and 55. All the
bins outside those bounds are ignored, as there are no matching tuples
there. The fractions and the bounds of first and last bin, ie. those
containing the lower and upper bounds, are adjusted according to the
boundary values, using linear interpolation. The lower bound, 15, falls
in the middle of the bin 10-20, and the upper bound, 55, splits the
40-100 bin at ratio of 1/5. The adjusted bins look like this:

Boundary 15 20 40 55
-+------+------+------+
Fraction 0.125 0.25 0.05

Next, we need to calculate what proportion of tuples in each bin has a
small enough length to be contained withing the query range. For that,
the distance of each bin boundary to the upper bound is calculated:

Distance 40 35 15 0
-+------+------+------+
Fraction 0.125 0.25 0.05

The bins are walked starting from the highest bin, ie. starting from
distance 0, walking up in increasing order of distance. For each bin,
the proportion of tuples within that range that have a suitable length
is calculated. The calc_length_hist_frac function does that. That
calculation is more complicated than it sounds: for example, for the
middle bin above, calc_length_hist_frac is passed both distance
boundaries, 15 and 35. calc_length_hist frac calculates the average of
P(x), when x slides linearly from 15 to 35, where P(x) is the fraction
of tuples with length <= x.

Now, here's a question, on something I didn't quite understand:

> * Returns average fraction of histogram which is greater than given length.
> * While this length is increasing from length1 to *length2. If histogram
> * ends up before *length2 then set covered fraction of (length1, *length2)
> * interval to *fraction and set end of histogram to *length2.
> */
> static double
> calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
> double length1, double *length2, double *fraction)
> {

Why the behavior explained in the last sentence in the above comment? It
seems that the abstraction provided by calc_length_hist_frac() is leaky;
the caller shouldn't need to know anything about the boundaries of the
length bins.

Ignoring that, I believe that calc_length_hist_frac can also be
explained like this:

> /*
> * Let P(x) be the fraction of tuples with length <= x.
> *
> * calc_length_hist_frac calculates the average of P(x), in the interval [A, B].
> *
> * This can be calculated by the formula:
> *
> * B
> * 1 /
> * ------- | P(x)dx
> * B - A /
> * A
> */
> static double
> calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
> double A, double B)

Am I correct this far? The function doesn't use the above formula as is,
but it could..

I'll continue trying to understand this and add comments..

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-03-12 12:38:29 Re: transforms
Previous Message Albe Laurenz 2013-03-12 09:27:02 Re: Column defaults for foreign tables (was Re: [v9.3] writable foreign tables)