Re: Statistics and selectivity estimation for ranges

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Statistics and selectivity estimation for ranges
Date: 2013-03-01 14:22:18
Message-ID: CAPpHfdt92Fh5Jbovn-EXM_7Yv4Mhr6CvaVo=37jE85h1G2RGkA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 13, 2013 at 5:55 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Wed, Feb 13, 2013 at 5:28 PM, Heikki Linnakangas <
> hlinnakangas(at)vmware(dot)com> wrote:
>
>> On 04.01.2013 10:42, Alexander Korotkov wrote:
>>
>>> /*
>>> * Calculate selectivity of "&&" operator using histograms of range
>>> lower bounds
>>> * and histogram of range lengths.
>>> */
>>> static double
>>> calc_hist_selectivity_overlap(**TypeCacheEntry *typcache, RangeBound
>>> *lower,
>>> RangeBound *upper, RangeBound
>>> *hist_lower, int hist_nvalues,
>>> Datum
>>> *length_hist_values, int length_hist_nvalues)
>>>
>>
>> We already have code to estimate &&, based on the lower and upper bound
>> histograms:
>>
>> case OID_RANGE_OVERLAP_OP:
>>> case OID_RANGE_CONTAINS_ELEM_OP:
>>> /*
>>> * A && B <=> NOT (A << B OR A >> B).
>>> *
>>> * "range @> elem" is equivalent to "range &&
>>> [elem,elem]". The
>>> * caller already constructed the singular range
>>> from the element
>>> * constant, so just treat it the same as &&.
>>> */
>>> hist_selec =
>>> calc_hist_selectivity_scalar(**typcache,
>>> &const_lower, hist_upper,
>>>
>>> nhist, false);
>>> hist_selec +=
>>> (1.0 - calc_hist_selectivity_scalar(**typcache,
>>> &const_upper, hist_lower,
>>>
>>> nhist, true));
>>> hist_selec = 1.0 - hist_selec;
>>> break;
>>>
>>
>> I don't think the method based on lower bound and length histograms is
>> any better. In fact, my gut feeling is that it's less accurate. I'd suggest
>> dropping that part of the patch.
>>
>
> Right. This estimation has an accuracy of histogram, while estimation
> based on lower bound and length histograms rely on additional assumption
> about independence of lower bound and length histogram. We can sum A << B
> and A >> B probabilities because they are mutually exclusive. It's pretty
> evident but I would like to mention it in the comments, because typical
> assumption about events in statistics calculation is their independence.
>

These changes were made in attached patch.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
range_stat-0.11.patch.gz application/x-gzip 8.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ants Aasma 2013-03-01 15:01:27 Re: Materialized views WIP patch
Previous Message Kevin Grittner 2013-03-01 14:18:33 Re: Materialized views WIP patch