Quick Links

Re: Statistics and selectivity estimation for ranges

Lists:	pgsql-hackers

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Statistics and selectivity estimation for ranges
Date:	2012-08-04 09:31:07
Message-ID:	CAPpHfdtbdtM1_rhta+2cNdB4Pj7MH8EATQZHDPzK3s64qS61eg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hackers,

attached patch is for collecting statistics and selectivity estimation for
ranges.

In order to make our estimations accurate for every distribution of
ranges, we would collect 2d-distribution of lower and upper bounds of range
into some kind of 2d-histogram. However, this patch use some simplification
and assume distribution of lower bound and distribution of length to be
independent. We can get distribution of lower bound from standard scalar
statistics and thi patch additionally collect statistics for range length.
This patch includes selectivity estimations for "&&", "@>" and "<@"
operators on ranges. Linear interpolation is used in order to get more
accurate results.

Some examples with test dataset where left bound of range is distributed by
gaussian distribution and length of range is distributed by exponential
distribution.

test=# CREATE TABLE range_test as (SELECT int4range(lb, lb + len) AS r FROM
(SELECT (sqrt(-2*ln(random())) * sin(2*pi()*random()) * 1000000)::int as
lb, (-10000*ln(1.0 - random()) + 1)::int as len FROM
generate_series(1,1000000)) x);
SELECT 1000000

test=# ANALYZE range_test;
ANALYZE

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r &&
int4range(700000,710000);
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=7535 width=14) (actual
time=0.119..403.494 rows=6138 loops=1)
Filter: (r && '[700000,710000)'::int4range)
Rows Removed by Filter: 993862
Total runtime: 403.945 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r &&
int4range(200000,300000);
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*42427* width=14)
(actual time=0.100..401.079 rows=*42725* loops=1)
Filter: (r && '[200000,300000)'::int4range)
Rows Removed by Filter: 957275
Total runtime: 403.055 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <@
int4range(100000,150000);
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*15341* width=14)
(actual time=0.129..382.114 rows=*16014* loops=1)
Filter: (r <@ '[100000,150000)'::int4range)
Rows Removed by Filter: 983986
Total runtime: 382.985 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <@
int4range(600000,603000);
QUERY PLAN

---------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*122* width=14) (actual
time=1.527..383.511 rows=*127* loops=1)
Filter: (r <@ '[600000,603000)'::int4range)
Rows Removed by Filter: 999873
Total runtime: 383.586 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r @>
int4range(100000,100400);
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*5166* width=14) (actual
time=0.238..377.712 rows=*3909* loops=1)
Filter: (r @> '[100000,100400)'::int4range)
Rows Removed by Filter: 996091
Total runtime: 378.018 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r @>
int4range(500000,530000);
QUERY PLAN

----------------------------------------------------------------------------------------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=*342* width=14) (actual
time=11.796..382.986 rows=*171* loops=1)
Filter: (r @> '[500000,530000)'::int4range)
Rows Removed by Filter: 999829
Total runtime: 383.066 ms
(4 rows)

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
range_stat-0.2.patch.gz	application/x-gzip	9.7 KB

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-06 14:09:09
Message-ID:	501FD005.3020808@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 04.08.2012 12:31, Alexander Korotkov wrote:
> Hackers,
>
> attached patch is for collecting statistics and selectivity estimation for
> ranges.
>
> In order to make our estimations accurate for every distribution of
> ranges, we would collect 2d-distribution of lower and upper bounds of range
> into some kind of 2d-histogram. However, this patch use some simplification
> and assume distribution of lower bound and distribution of length to be
> independent.

Sounds reasonable. Another possibility would be to calculate the average
length for each lower-bound bin. So you would e.g know the average
length of values with lower bound between 1-10, and the average length
of values with lower bound between 10-20, and so forth. Within a bin,
you would have to assume that the distribution of the lengths is fixed.

PS. get_position() should guard against division by zero, when subdiff
returns zero.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-07 04:25:45
Message-ID:	CAPpHfds=0729kHnS0EGT3ydte1uBOxczCCVys--nfyqaH+DALA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 6, 2012 at 6:09 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> On 04.08.2012 12:31, Alexander Korotkov wrote:
>
>> Hackers,
>>
>> attached patch is for collecting statistics and selectivity estimation for
>> ranges.
>>
>> In order to make our estimations accurate for every distribution of
>> ranges, we would collect 2d-distribution of lower and upper bounds of
>> range
>> into some kind of 2d-histogram. However, this patch use some
>> simplification
>> and assume distribution of lower bound and distribution of length to be
>> independent.
>>
>
> Sounds reasonable. Another possibility would be to calculate the average
> length for each lower-bound bin. So you would e.g know the average length
> of values with lower bound between 1-10, and the average length of values
> with lower bound between 10-20, and so forth. Within a bin, you would have
> to assume that the distribution of the lengths is fixed.
>

Interesting idea. AFAICS, if we store average length for each lower-bound
bin, we still have to assume some kind of distribution of range length in
order to do estimates. For example, assume that range length have
exponential distribution. Correspondingly, we've following trade off: we
don't have to assume lower bound distribution to be independent from length
distribution, but we have to assume kind of length distribution. Actually,
I don't know what is better.
Ideally, we would have range length histogram for each lower-bound bin, or
upper-bound histogram for each lower-bound bin. But, storing such amount of
data seems too expensive.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-08 20:44:52
Message-ID:	CAPpHfdsaGJ_u+1Bw0SsQsF4O=b90sMqQw0eDpbcQi7vTqwvw1g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

For testing statistics accuracy I've used same datasets as for testing
opclasses performance:
http://archives.postgresql.org/pgsql-hackers/2012-07/msg00414.php
Script for testing and database schema is attached.
Dump with tests results can be downloaded here:
http://test.i-gene.ru/uploads/range_stat_tests.sql.gz

Following table shows statistics of accuracy when actual count of rows is
somewhat large (>=10). Second column shows average ratio of estimate count
of rows to actual count of rows. Third column shows average relative error
of estimation.

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count >= 10 group by operator;
operator | avg_ratio | avg_error
----------+------------------+-------------------
<@ | 1.27166784340153 | 0.498570654434906
@> | 1.35965412121763 | 0.384991198200582
&& | 1.08236985243139 | 0.105298599354035
(3 rows)

When result set is small (1-9 rows) then errors are more significant.

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count between 1 and 9 group by operator;
operator | avg_ratio | avg_error
----------+------------------+------------------
<@ | 3.51371646596783 | 2.85624536756285
@> | 3.85482923324034 | 2.91433432363562
&& | 3.14281204906205 | 2.28899260461761
(3 rows)

Following table presents average estimate count of rows when actual count
of rows is 0. This value is quite high for && operator, but it comes from
only one tests, so it's not really representative.

range_test=# select operator, avg(estimate_count) as avg_estimate, count(*)
as tests_count from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 100 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+---------------------+-------------
<@ | 1.1259887005649718 | 1770
@> | 1.0598670878194025 | 88329
&& | 28.0000000000000000 | 1
(3 rows)

Same tables for statistics target = 1000.

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count >= 10 group by operator;
operator | avg_ratio | avg_error
----------+------------------+--------------------
<@ | 1.17132962269887 | 0.394427785424827
@> | 1.35677772347908 | 0.376171286348914
&& | 1.06762781136499 | 0.0874012522386387
(3 rows)

range_test=# select operator,
avg(estimate_count::float8/actual_count::float8) as avg_ratio,
avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as
avg_error from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count between 1 and 9 group by operator;
operator | avg_ratio | avg_error
----------+------------------+------------------
<@ | 3.30836881177966 | 2.64459517657192
@> | 3.47535917820028 | 2.55199556747496
&& | 2.49181718664477 | 1.49181718664477
(3 rows)

range_test=# select operator, avg(estimate_count) as avg_estimate, count(*)
as tests_count from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 1000 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+--------------------+-------------
<@ | 1.1650879566982409 | 739
@> | 1.0511811463771843 | 89447
(2 rows)

My conclusion is so, that current errors are probably ok for selectivity
estimation. But taking into attention that generated datasets ideally fits
assumptions of estimation, there could be room for improvement. Especially,
it's unclear why estimate for "<@" and "@>" have much greater error than
estimate for "&&". Possibly, it's caused by some bugs.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
range_stat.php	application/x-httpd-php	6.7 KB
range_stat_schema.sql	application/octet-stream	105.2 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-09 15:35:50
Message-ID:	CAPpHfdscMxmKp3qExZ7FJxEzMgv6hr0zx6w_CBEq=D4xukMr5w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

New revision of patch with two fixes:
1) Check if histogram bin width is zero in get_position.
2) Check statsTuple is valid tuple in rangecontsel.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
range_stat-0.3.patch.gz	application/x-gzip	10.0 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-12 21:11:51
Message-ID:	CAPpHfdtMUSF64MaRNn+fxMZxmN_BmGftTfUYn3=3MoDo1saDLQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 9, 2012 at 12:44 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> My conclusion is so, that current errors are probably ok for selectivity
> estimation. But taking into attention that generated datasets ideally fits
> assumptions of estimation, there could be room for improvement. Especially,
> it's unclear why estimate for "<@" and "@>" have much greater error than
> estimate for "&&". Possibly, it's caused by some bugs.
>

ITSM, I found reason of inaccuracy. Implementation of linear interpolation
was wrong. Fixed version is attached. Now, need to rerun tests, possible
refactoring and comments rework.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
range_stat-0.4.patch.gz	application/x-gzip	10.4 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-14 06:45:58
Message-ID:	CAPpHfdudkYhJkGbhQtbjt6GG5Owp_v7_kpQKA=1Dz5Fb8iGteA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 13, 2012 at 1:11 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Thu, Aug 9, 2012 at 12:44 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:
>
>> My conclusion is so, that current errors are probably ok for selectivity
>> estimation. But taking into attention that generated datasets ideally fits
>> assumptions of estimation, there could be room for improvement. Especially,
>> it's unclear why estimate for "<@" and "@>" have much greater error than
>> estimate for "&&". Possibly, it's caused by some bugs.
>>
>
> ITSM, I found reason of inaccuracy. Implementation of linear interpolation
> was wrong. Fixed version is attached. Now, need to rerun tests, possible
> refactoring and comments rework.
>

After fixing few more bugs, I've a version with much more reasonable
accuracy.

Statistics target = 100.

Relatively large result sets (>= 10)

test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 100 and actual_count >= 10 group by
operator;
operator | avg_ratio | avg_error
----------+------------------+--------------------
<@ | 1.00404179116863 | 0.0504415454560903
@> | 1.06364108531688 | 0.105646077989812
&& | 1.00757984721409 | 0.0420984234933233
(3 rows)

Small result sets (1 - 9)

test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 100 and actual_count between 1 and 9 group
by
operator;
operator | avg_ratio | avg_error
----------+------------------+-------------------
<@ | 1.31530838062865 | 0.654886592410495
@> | 2.78708078320147 | 1.94124123003433
&& | 1.93268112525538 | 1.09904919063335
(3 rows)

Empty result sets

test=# select operator, avg(estimate_count) as avg_estimate, count(*) as
tests_count from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 100 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+--------------------+-------------
<@ | 1.1437670609645132 | 1099
@> | 1.0479430126460701 | 87458
(2 rows)

Statistics target = 1000.

Relatively large result sets (>= 10)

test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 1000 and actual_count >= 10 group by
operator;
operator | avg_ratio | avg_error
----------+------------------+--------------------
<@ | 1.00073999445381 | 0.045099762607524
@> | 1.05296320350853 | 0.0907489633452971
&& | 1.00217602359039 | 0.0353421159150165
(3 rows)

Small result sets (1 - 9)

test=# select operator, avg(estimate_count::float8/actual_count::float8) as
avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) -
1.0 as avg_error from datasets d join test_results tr on tr.test_id =
d.idwhere d.stat_target = 1000 and actual_count between 1 and 9 group
by
operator;
operator | avg_ratio | avg_error
----------+------------------+-------------------
<@ | 1.26946358795998 | 0.577803898836364
@> | 2.69000633430211 | 1.83165424646645
&& | 1.48715184186882 | 0.577998652291105
(3 rows)

Empty result sets

test=# select operator, avg(estimate_count) as avg_estimate, count(*) as
tests_count from datasets d join test_results tr on tr.test_id = d.id where
d.stat_target = 1000 and actual_count = 0 group by operator;
operator | avg_estimate | tests_count
----------+--------------------+-------------
<@ | 1.0887096774193548 | 1364
@> | 1.0423876983771183 | 89224
&& | 5.0000000000000000 | 1
(3 rows)

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
range_stat-0.6.patch.gz	application/x-gzip	10.4 KB

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-14 15:46:18
Message-ID:	502A72CA.80704@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 14.08.2012 09:45, Alexander Korotkov wrote:
> After fixing few more bugs, I've a version with much more reasonable
> accuracy.

Great! One little thing just occurred to me:

You're relying on the regular scalar selectivity estimators for the <<,
>>, &< and &> operators. That seems bogus, in particular for << and &<,
because ineq_histogram_selectivity then performs a binary search of the
histogram using those operators. << and &< compare the *upper* bound of
the value in table against the lower bound of constant, but the
histogram is constructed using regular < operator, which sorts the
entries by lower bound. I think the estimates you now get for those
operators are quite bogus if there is a non-trivial amount of overlap
between ranges. For example:

postgres=# create table range_test as
select int4range(-a, a) as r from generate_series(1,1000000) a; analyze
range_test;
SELECT 1000000
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <<
int4range(200000, 200001);
QUERY PLAN

--------------------------------------------------------------------------------
-----------------------------------
Seq Scan on range_test (cost=0.00..17906.00 rows=100 width=14)
(actual time=0.
060..1340.147 rows=200000 loops=1)
Filter: (r << '[200000,200001)'::int4range)
Rows Removed by Filter: 800000
Total runtime: 1371.865 ms
(4 rows)

It would be quite easy to provide reasonable estimates for those
operators, if we had a separate histogram of upper bounds. I also note
that the estimation of overlap selectivity could be implemented using
separate histograms of lower bounds and upper bounds, without requiring
a histogram of range lengths, because a && b == NOT (a <> b).
I'm not sure if the estimates we'd get that way would be better or worse
than your current method, but I think it would be easier to understand.

I don't think the &< and &> operators could be implemented in terms of a
lower and upper bound histogram, though, so you'd still need the current
length histogram method for that.

The code in that traverses the lower bound and length histograms in
lockstep looks quite scary. Any ideas on how to simplify that? My first
thought is that there should be helper function that gets a range length
as argument, and returns the fraction of tuples with length >= argument.
It would do the lookup in the length histogram to find the right
histogram bin, and do the linear interpolation within the bin. You're
assuming that length is independent of lower/upper bound, so you
shouldn't need any other parameters than range length for that estimation.

You could then loop through only the lower bounds, and call the helper
function for each bin to get the fraction of ranges long enough in that
bin, instead dealing with both histograms in the same loop. I think a
helper function like that might simplify those scary loops
significantly, but I wasn't sure if there's some more intelligence in
the way you combine values from the length and lower bound histograms
that you couldn't do with such a helper function.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-15 07:38:16
Message-ID:	CAPpHfdsKfL_ZAxp4H-6Z3MUy2engekFB9DE7_Yrob9X8rXzxKA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 14, 2012 at 7:46 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> On 14.08.2012 09:45, Alexander Korotkov wrote:
>
>> After fixing few more bugs, I've a version with much more reasonable
>> accuracy.
>>
>
> Great! One little thing just occurred to me:
>
> You're relying on the regular scalar selectivity estimators for the <<,
> >>, &< and &> operators. That seems bogus, in particular for << and &<,
> because ineq_histogram_selectivity then performs a binary search of the
> histogram using those operators. << and &< compare the *upper* bound of the
> value in table against the lower bound of constant, but the histogram is
> constructed using regular < operator, which sorts the entries by lower
> bound. I think the estimates you now get for those operators are quite
> bogus if there is a non-trivial amount of overlap between ranges. For
> example:
>
> postgres=# create table range_test as
> select int4range(-a, a) as r from generate_series(1,1000000) a; analyze
> range_test;
> SELECT 1000000
> ANALYZE
> postgres=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <<
> int4range(200000, 200001);
> QUERY PLAN
>
> ------------------------------**------------------------------**
> --------------------
> ------------------------------**-----
> Seq Scan on range_test (cost=0.00..17906.00 rows=100 width=14) (actual
> time=0.
> 060..1340.147 rows=200000 loops=1)
> Filter: (r << '[200000,200001)'::int4range)
> Rows Removed by Filter: 800000
> Total runtime: 1371.865 ms
> (4 rows)
>
> It would be quite easy to provide reasonable estimates for those
> operators, if we had a separate histogram of upper bounds. I also note that
> the estimation of overlap selectivity could be implemented using separate
> histograms of lower bounds and upper bounds, without requiring a histogram
> of range lengths, because a && b == NOT (a <> b). I'm not sure if
> the estimates we'd get that way would be better or worse than your current
> method, but I think it would be easier to understand.
>
> I don't think the &< and &> operators could be implemented in terms of a
> lower and upper bound histogram, though, so you'd still need the current
> length histogram method for that.
>

Oh, actually I didn't touch those operators. Selectivity estimation
functions for them were already in the catalog, they didn't work previously
just because no statistics. Histogram of upper bounds would be both more
accurate and natural for some operators. However, it requires collecting
additional statistics while AFAICS it doesn't liberate us from having
histogram of range lengths.

> The code in that traverses the lower bound and length histograms in
> lockstep looks quite scary. Any ideas on how to simplify that? My first
> thought is that there should be helper function that gets a range length as
> argument, and returns the fraction of tuples with length >= argument. It
> would do the lookup in the length histogram to find the right histogram
> bin, and do the linear interpolation within the bin. You're assuming that
> length is independent of lower/upper bound, so you shouldn't need any other
> parameters than range length for that estimation.
>
> You could then loop through only the lower bounds, and call the helper
> function for each bin to get the fraction of ranges long enough in that
> bin, instead dealing with both histograms in the same loop. I think a
> helper function like that might simplify those scary loops significantly,
> but I wasn't sure if there's some more intelligence in the way you combine
> values from the length and lower bound histograms that you couldn't do with
> such a helper function.

Yes, I also thought about something like this. But, in order to save
current estimate accuracy, it should be more complicated in following
reasons:
1) In last version, I don't estimate just fraction of tuples with length >=
argument, but area under length histogram between two length bounds
(length_hist_summ).
2) In histogram ends up before reaching given length bound we also need to
return place where it happened. Now it is performed by hist_frac *= (length
- prev_dist) / (dist - prev_dist).
I'm going to try some simplification with taking care about both mentioned
aspects.

------
With best regards,
Alexander Korotkov.

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-15 08:14:46
Message-ID:	502B5A76.7000205@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15.08.2012 10:38, Alexander Korotkov wrote:
> On Tue, Aug 14, 2012 at 7:46 PM, Heikki Linnakangas<
> heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>
>> It would be quite easy to provide reasonable estimates for those
>> operators, if we had a separate histogram of upper bounds. I also note that
>> the estimation of overlap selectivity could be implemented using separate
>> histograms of lower bounds and upper bounds, without requiring a histogram
>> of range lengths, because a&& b == NOT (a<> b). I'm not sure if
>> the estimates we'd get that way would be better or worse than your current
>> method, but I think it would be easier to understand.
>>
>> I don't think the&< and&> operators could be implemented in terms of a
>> lower and upper bound histogram, though, so you'd still need the current
>> length histogram method for that.
>
> Oh, actually I didn't touch those operators. Selectivity estimation
> functions for them were already in the catalog, they didn't work previously
> just because no statistics.

Yeah, without the histogram, the scalar selectivity estimator sort-of
works, in that it returns the estimate just based on the most common
values and a constant.

> Histogram of upper bounds would be both more
> accurate and natural for some operators. However, it requires collecting
> additional statistics while AFAICS it doesn't liberate us from having
> histogram of range lengths.

Hmm, if we collected a histogram of lower bounds and a histogram of
upper bounds, that would be roughly the same amount of data as for the
"standard" histogram with both bounds in the same histogram.

>> The code in that traverses the lower bound and length histograms in
>> lockstep looks quite scary. Any ideas on how to simplify that? My first
>> thought is that there should be helper function that gets a range length as
>> argument, and returns the fraction of tuples with length>= argument. It
>> would do the lookup in the length histogram to find the right histogram
>> bin, and do the linear interpolation within the bin. You're assuming that
>> length is independent of lower/upper bound, so you shouldn't need any other
>> parameters than range length for that estimation.
>>
>> You could then loop through only the lower bounds, and call the helper
>> function for each bin to get the fraction of ranges long enough in that
>> bin, instead dealing with both histograms in the same loop. I think a
>> helper function like that might simplify those scary loops significantly,
>> but I wasn't sure if there's some more intelligence in the way you combine
>> values from the length and lower bound histograms that you couldn't do with
>> such a helper function.
>
> Yes, I also thought about something like this. But, in order to save
> current estimate accuracy, it should be more complicated in following
> reasons:
> 1) In last version, I don't estimate just fraction of tuples with length>=
> argument, but area under length histogram between two length bounds
> (length_hist_summ).
> 2) In histogram ends up before reaching given length bound we also need to
> return place where it happened. Now it is performed by hist_frac *= (length
> - prev_dist) / (dist - prev_dist).
> I'm going to try some simplification with taking care about both mentioned
> aspects.

Thanks.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-15 08:34:07
Message-ID:	CAPpHfdsPCcCERxywysaYoPUjk3OE6dM9Mm_4Yr86OkAJsSfuHA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 15, 2012 at 12:14 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> Histogram of upper bounds would be both more
>> accurate and natural for some operators. However, it requires collecting
>> additional statistics while AFAICS it doesn't liberate us from having
>> histogram of range lengths.
>>
>
> Hmm, if we collected a histogram of lower bounds and a histogram of upper
> bounds, that would be roughly the same amount of data as for the "standard"
> histogram with both bounds in the same histogram.

Ok, we've to decide if we need "standard" histogram. In some cases it can
be used for more accurate estimation of < and > operators.
But I think it is not so important. So, we can replace "standard" histogram
with histograms of lower and upper bounds?

------
With best regards,
Alexander Korotkov.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-15 16:33:18
Message-ID:	7576.1345048398@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> Ok, we've to decide if we need "standard" histogram. In some cases it can
> be used for more accurate estimation of < and > operators.
> But I think it is not so important. So, we can replace "standard" histogram
> with histograms of lower and upper bounds?

You should assign a new pg_statistic "kind" value (see pg_statistic.h)
rather than mislabel this as being a standard histogram. However,
there's nothing wrong with a data-type-specific stats collection
function choosing to gather only this type of histogram and not the
standard one.

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-16 12:40:45
Message-ID:	502CEA4D.4060704@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15.08.2012 11:34, Alexander Korotkov wrote:
> On Wed, Aug 15, 2012 at 12:14 PM, Heikki Linnakangas<
> heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>
>> Histogram of upper bounds would be both more
>>> accurate and natural for some operators. However, it requires collecting
>>> additional statistics while AFAICS it doesn't liberate us from having
>>> histogram of range lengths.
>>
>> Hmm, if we collected a histogram of lower bounds and a histogram of upper
>> bounds, that would be roughly the same amount of data as for the "standard"
>> histogram with both bounds in the same histogram.
>
> Ok, we've to decide if we need "standard" histogram. In some cases it can
> be used for more accurate estimation of< and> operators.
> But I think it is not so important. So, we can replace "standard" histogram
> with histograms of lower and upper bounds?

Yeah, I think that makes more sense. The lower bound histogram is still
useful for < and > operators, just not as accurate if there are lots of
values with the same lower bound but different upper bound.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-16 12:41:43
Message-ID:	502CEA87.3030003@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-19 21:31:36
Message-ID:	CAPpHfdta8WyDD1bA7-g6qDmoPOEC6dchiyOcysPfQMm+QfwafQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 16, 2012 at 4:40 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> On 15.08.2012 11:34, Alexander Korotkov wrote:
>
>> Ok, we've to decide if we need "standard" histogram. In some cases it can
>> be used for more accurate estimation of< and> operators.
>> But I think it is not so important. So, we can replace "standard"
>> histogram
>> with histograms of lower and upper bounds?
>>
>
> Yeah, I think that makes more sense. The lower bound histogram is still
> useful for < and > operators, just not as accurate if there are lots of
> values with the same lower bound but different upper bound.

New version of patch.
* Collect new stakind STATISTIC_KIND_BOUNDS_HISTOGRAM, which is lower and
upper bounds histograms combined into single ranges array, instead
of STATISTIC_KIND_HISTOGRAM.
* Selectivity estimations for >, >=, <, <=, <<, >>, &<, &> using this
histogram.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
range_stat-0.7.patch.gz	application/x-gzip	12.1 KB

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-21 13:25:02
Message-ID:	50338C2E.5060908@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20.08.2012 00:31, Alexander Korotkov wrote:
> On Thu, Aug 16, 2012 at 4:40 PM, Heikki Linnakangas<
> heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>
>> On 15.08.2012 11:34, Alexander Korotkov wrote:
>>
>>> Ok, we've to decide if we need "standard" histogram. In some cases it can
>>> be used for more accurate estimation of< and> operators.
>>> But I think it is not so important. So, we can replace "standard"
>>> histogram
>>> with histograms of lower and upper bounds?
>>
>> Yeah, I think that makes more sense. The lower bound histogram is still
>> useful for< and> operators, just not as accurate if there are lots of
>> values with the same lower bound but different upper bound.
>
> New version of patch.
> * Collect new stakind STATISTIC_KIND_BOUNDS_HISTOGRAM, which is lower and
> upper bounds histograms combined into single ranges array, instead
> of STATISTIC_KIND_HISTOGRAM.

Ah, that's an interesting approach. So essentially, the histogram looks
just like a normal STATISTIC_KIND_HISTOGRAM histogram, but the values
stored in it are not picked the usual way. The usual way would be to
pick N evenly-spaced values from the column, and store those. Instead,
you pick N evenly-spaced lower bounds, and N evenly-spaced upper bounds,
and construct N range values from those. Looking at a single value in
the histogram, its lower bound comes from a different row than its upper
bound.

That's pretty clever - the histogram has a shape and order that's
compatible with a histogram you'd get with the standard scalar
typanalyze function. In fact, I think you could just let the standard
scalar estimators for < and > to use that histogram as is. Perhaps we
should use STATISTIC_KIND_HISTOGRAM for this after all...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-24 15:51:20
Message-ID:	5037A2F8.4080808@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20.08.2012 00:31, Alexander Korotkov wrote:
> On Thu, Aug 16, 2012 at 4:40 PM, Heikki Linnakangas<
> heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>
>> On 15.08.2012 11:34, Alexander Korotkov wrote:
>>
>>> Ok, we've to decide if we need "standard" histogram. In some cases it can
>>> be used for more accurate estimation of< and> operators.
>>> But I think it is not so important. So, we can replace "standard"
>>> histogram
>>> with histograms of lower and upper bounds?
>>>
>>
>> Yeah, I think that makes more sense. The lower bound histogram is still
>> useful for< and> operators, just not as accurate if there are lots of
>> values with the same lower bound but different upper bound.
>
>
> New version of patch.
> * Collect new stakind STATISTIC_KIND_BOUNDS_HISTOGRAM, which is lower and
> upper bounds histograms combined into single ranges array, instead
> of STATISTIC_KIND_HISTOGRAM.

One worry I have about that format for the histogram is that you
deserialize all the values in the histogram, before you do the binary
searches. That seems expensive if stats target is very high. I guess you
could deserialize them lazily to alleviate that, though.

> * Selectivity estimations for>,>=,<,<=,<<,>>,&<,&> using this
> histogram.

Thanks!

I'm going to do the same for this that I did for the sp-gist patch, and
punt on the more complicated parts for now, and review them separately.
Attached is a heavily edited version that doesn't include the length
histogram, and consequently doesn't do anything smart for the &< and &>
operators. && is estimated using the bounds histograms. There's now a
separate stakind for the empty range fraction, since it's not included
in the length-histogram.

I tested this on a dataset containing birth and death dates of persons
that have a wikipedia page, obtained from the dbpedia.org project. I can
send a copy if someone wants it. The estimates seem pretty accurate.

Please take a look, to see if I messed up something.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
range_stat-0.7-trimmed-and-edited.patch	text/x-diff	31.0 KB

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-27 13:00:55
Message-ID:	503B6F87.8080509@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 24.08.2012 18:51, Heikki Linnakangas wrote:
> On 20.08.2012 00:31, Alexander Korotkov wrote:
>> New version of patch.
>> * Collect new stakind STATISTIC_KIND_BOUNDS_HISTOGRAM, which is lower and
>> upper bounds histograms combined into single ranges array, instead
>> of STATISTIC_KIND_HISTOGRAM.
>
> One worry I have about that format for the histogram is that you
> deserialize all the values in the histogram, before you do the binary
> searches. That seems expensive if stats target is very high. I guess you
> could deserialize them lazily to alleviate that, though.
>
>> * Selectivity estimations for>,>=,<,<=,<<,>>,&<,&> using this
>> histogram.
>
> Thanks!
>
> I'm going to do the same for this that I did for the sp-gist patch, and
> punt on the more complicated parts for now, and review them separately.
> Attached is a heavily edited version that doesn't include the length
> histogram, and consequently doesn't do anything smart for the &< and &>
> operators. && is estimated using the bounds histograms. There's now a
> separate stakind for the empty range fraction, since it's not included
> in the length-histogram.
>
> I tested this on a dataset containing birth and death dates of persons
> that have a wikipedia page, obtained from the dbpedia.org project. I can
> send a copy if someone wants it. The estimates seem pretty accurate.
>
> Please take a look, to see if I messed up something.

Committed this with some further changes.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-08-27 14:17:44
Message-ID:	CAPpHfdtg3ktcb30EuMKCDgYiHv=jmMXLXPsx9cU5ydf=XzsqyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 27, 2012 at 5:00 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> On 24.08.2012 18:51, Heikki Linnakangas wrote:
>
>> On 20.08.2012 00:31, Alexander Korotkov wrote:
>>
>>> New version of patch.
>>> * Collect new stakind STATISTIC_KIND_BOUNDS_**HISTOGRAM, which is lower
>>> and
>>> upper bounds histograms combined into single ranges array, instead
>>> of STATISTIC_KIND_HISTOGRAM.
>>>
>>
>> One worry I have about that format for the histogram is that you
>> deserialize all the values in the histogram, before you do the binary
>> searches. That seems expensive if stats target is very high. I guess you
>> could deserialize them lazily to alleviate that, though.
>>
>> * Selectivity estimations for>,>=,<,<=,<<,>>,&<,&> using this
>>> histogram.
>>>
>>
>> Thanks!
>>
>> I'm going to do the same for this that I did for the sp-gist patch, and
>> punt on the more complicated parts for now, and review them separately.
>> Attached is a heavily edited version that doesn't include the length
>> histogram, and consequently doesn't do anything smart for the &< and &>
>> operators. && is estimated using the bounds histograms. There's now a
>> separate stakind for the empty range fraction, since it's not included
>> in the length-histogram.
>>
>> I tested this on a dataset containing birth and death dates of persons
>> that have a wikipedia page, obtained from the dbpedia.org project. I can
>> send a copy if someone wants it. The estimates seem pretty accurate.
>>
>> Please take a look, to see if I messed up something.
>>
>
> Committed this with some further changes.

Thanks! Sorry for I didn't provide a feedback for previous message.
Commited patch looks nice for me. I'm going to provide additional patch
with length-histogram and more selectivity estimates.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-09-04 13:27:00
Message-ID:	CAPpHfds=cxF13Zg3WBzbqBy35djGSbKpTve_i73RMQKY_B-08g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 27, 2012 at 5:00 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

Addon patch is attached. Actually, I don't get your intention of
introducing STATISTIC_KIND_RANGE_EMPTY_FRAC stakind. Did you plan to leave
it as empty frac in distinct stakind or replace this stakind
with STATISTIC_KIND_LENGTH_HISTOGRAM? In the attached
patch STATISTIC_KIND_RANGE_EMPTY_FRAC is replaced
with STATISTIC_KIND_LENGTH_HISTOGRAM.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
range_stat-addon-0.1.patch.gz	application/x-gzip	6.9 KB

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-09-30 23:22:01
Message-ID:	1349047321.15580.17.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2012-09-04 at 17:27 +0400, Alexander Korotkov wrote:
> Addon patch is attached. Actually, I don't get your intention of
> introducing STATISTIC_KIND_RANGE_EMPTY_FRAC stakind. Did you plan to
> leave it as empty frac in distinct stakind or replace this stakind
> with STATISTIC_KIND_LENGTH_HISTOGRAM? In the attached
> patch STATISTIC_KIND_RANGE_EMPTY_FRAC is replaced
> with STATISTIC_KIND_LENGTH_HISTOGRAM.

Review comments:

1. In compute_range_stats, you need to guard against the case where
there is no subdiff function. Perhaps default to 1.0 or something?

2. I think it would be helpful to add comments regarding what happens
when lengths are identical, right now it's a little confusing. For
instance, the comment: "Generate a length histogram slot entry if there
are at least two length values" doesn't seem right, because the
condition still matches even if there is only one distinct value.

3. It looks like get_distance also needs to guard against a missing
subdiff.

4. There are 3 binary search functions, which seems a little excessive:
* rbound_bsearch: greatest i such that hist[i] < v; or -1
* rbound_bsearch_equal: greatest i such that:
hist[i] <= v and (i=0 or hist[i-1] != hist[i]); or -1
* length_hist_bsearch: least i such that hist[i] >= v;
or length of hist
(let me know if I misunderstood the definitions)
At a minimum, we need more consistent and informative names. Also, the
definition of rbound_bsearch_equal is a little confusing because it's
looking for the highest index among distinct values, but the lowest
index among identical values. Do you see a way to refactor these to be a
little easier to understand?

Regards,
Jeff Davis

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-10-09 20:08:03
Message-ID:	CAPpHfdvPR8whLzjX70zx42Ur2_kjC2uTQW=vkPYwiTEtS=SGgA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Oct 1, 2012 at 3:22 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> On Tue, 2012-09-04 at 17:27 +0400, Alexander Korotkov wrote:
> > Addon patch is attached. Actually, I don't get your intention of
> > introducing STATISTIC_KIND_RANGE_EMPTY_FRAC stakind. Did you plan to
> > leave it as empty frac in distinct stakind or replace this stakind
> > with STATISTIC_KIND_LENGTH_HISTOGRAM? In the attached
> > patch STATISTIC_KIND_RANGE_EMPTY_FRAC is replaced
> > with STATISTIC_KIND_LENGTH_HISTOGRAM.
>
> Review comments:
>
> 1. In compute_range_stats, you need to guard against the case where
> there is no subdiff function. Perhaps default to 1.0 or something?
>

Let it be 1.0 without subdiff function. However, there is not so much use
of this method of estimation without subdiff. But, probably it's better
than nothing.

2. I think it would be helpful to add comments regarding what happens
> when lengths are identical, right now it's a little confusing. For
> instance, the comment: "Generate a length histogram slot entry if there
> are at least two length values" doesn't seem right, because the
> condition still matches even if there is only one distinct value.
>
I've rephrased comment. Not it's implicitly says that collected values are
not necessary distinct.

> 3. It looks like get_distance also needs to guard against a missing
> subdiff.
>

Same to compute_range_stats. Let default value be 1.0.

> 4. There are 3 binary search functions, which seems a little excessive:
> * rbound_bsearch: greatest i such that hist[i] < v; or -1
> * rbound_bsearch_equal: greatest i such that:
> hist[i] <= v and (i=0 or hist[i-1] != hist[i]); or -1
> * length_hist_bsearch: least i such that hist[i] >= v;
> or length of hist
> (let me know if I misunderstood the definitions)
> At a minimum, we need more consistent and informative names. Also, the
> definition of rbound_bsearch_equal is a little confusing because it's
> looking for the highest index among distinct values, but the lowest
> index among identical values. Do you see a way to refactor these to be a
> little easier to understand?
>

Actually, goal of rbound_bsearch_equal is to find histogram bin to start
interpolation from. I've renamed it to rbound_bsearch_bin and added
corresponding comment.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
range_stat-0.8.patch.gz	application/x-gzip	7.1 KB

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-10-18 18:12:51
Message-ID:	20121018181251.GJ3763@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki, would you be able to give this patch a look and perhaps commit
it?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-11-05 14:12:38
Message-ID:	20121105141238.GC12444@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

What's going on with this patch? I haven't seen any activity in a
while. Should I just move this to the next commitfest?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-11-06 06:10:50
Message-ID:	1352182250.6292.31.camel@jdavis-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2012-11-05 at 11:12 -0300, Alvaro Herrera wrote:
> What's going on with this patch? I haven't seen any activity in a
> while. Should I just move this to the next commitfest?

Sorry, I dropped the ball here. I will still review it, whether it makes
this commitfest or not.

Regards,
Jeff Davis

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-12-08 15:08:38
Message-ID:	20121208150837.GA19028@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2012-11-05 22:10:50 -0800, Jeff Davis wrote:
> On Mon, 2012-11-05 at 11:12 -0300, Alvaro Herrera wrote:
> > What's going on with this patch? I haven't seen any activity in a
> > while. Should I just move this to the next commitfest?
>
> Sorry, I dropped the ball here. I will still review it, whether it makes
> this commitfest or not.

Sorry to nag, but it starts to look like it might fall of the end of the
next CF...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-12-10 19:21:44
Message-ID:	1355167304.3896.37.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

It looks like there are still some problems with this patch.

CREATE TABLE foo(ir int4range);
insert into foo select 'empty' from generate_series(1,10000);
insert into foo select int4range(NULL, g, '(]')
from generate_series(1,1000000) g;
insert into foo select int4range(g, NULL, '[)')
from generate_series(1,1000000) g;
insert into foo select int4range(g, ((g*1.01)+10)::int4, '[]')
from generate_series(1,1000000) g;
CREATE TABLE bar(ir) AS select * from foo order by random();
ANALYZE bar;

Now:
EXPLAIN ANALYZE SELECT * FROM bar
WHERE ir @> int4range(10000,20000);

The estimates are "-nan". Similar for many other queries.

And I have a few other questions/comments:

* Why is "summ" spelled with two "m"s? Is it short for "summation"? If
so, might be good to use "summation of" instead of "integrate" in the
comment.

* Why does get_length_hist_frac return 0.0 when i is the last value? Is
that a mistake?

* I am still confused by the distinction between rbound_bsearch and
rbound_bsearch_bin. What is the intuitive purpose of each?

* You use "constant value" in the comments in several places. Would
"query value" or "search key" be better?

Regards,
Jeff Davis

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-12-13 20:40:33
Message-ID:	CAPpHfdtHXh+xuJzKBbzK1NFjbtia5vzK35uJiXgiKE9oaqmtOg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, Jeff!

Thanks a lot for review!

On Mon, Dec 10, 2012 at 11:21 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> It looks like there are still some problems with this patch.
>
> CREATE TABLE foo(ir int4range);
> insert into foo select 'empty' from generate_series(1,10000);
> insert into foo select int4range(NULL, g, '(]')
> from generate_series(1,1000000) g;
> insert into foo select int4range(g, NULL, '[)')
> from generate_series(1,1000000) g;
> insert into foo select int4range(g, ((g*1.01)+10)::int4, '[]')
> from generate_series(1,1000000) g;
> CREATE TABLE bar(ir) AS select * from foo order by random();
> ANALYZE bar;
>
> Now:
> EXPLAIN ANALYZE SELECT * FROM bar
> WHERE ir @> int4range(10000,20000);
>
> The estimates are "-nan". Similar for many other queries.
>

Oh, yeah! It appears that infinities require much more cautious work with
them than I supposed. That should be fixes in the attached version of
patch. However, it require significant rethinking of comments. Will update
comments and address your questions in a couple of days. Could you recheck
if attached patch really fixes problem you reported?

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
range_stat-0.9.patch.gz	application/x-gzip	7.2 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2013-01-04 08:42:13
Message-ID:	CAPpHfduHHH2gdOc7+0dY0Qw4uuiA7bKSkR_7ipq8U_upEZGiwQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 10, 2012 at 11:21 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> And I have a few other questions/comments:
>
> * Why is "summ" spelled with two "m"s? Is it short for "summation"? If
> so, might be good to use "summation of" instead of "integrate" in the
> comment.
>

Fixed.

> * Why does get_length_hist_frac return 0.0 when i is the last value? Is
> that a mistake?
>

Comment was wrong. Actually it return fraction fraction of ranges which
length is *greater*.

> * I am still confused by the distinction between rbound_bsearch and
> rbound_bsearch_bin. What is the intuitive purpose of each?
>

I've added corresponding comments. rbound_bsearch is for scalar operators
and for bin corresponding to upper bound. rbound_bsearch_bin is
now rbound_bsearch_bin_lower. It is for bin corresponding to lower bound.

* You use "constant value" in the comments in several places. Would
> "query value" or "search key" be better?
>

Yes. Fixed.

I also renamed get_length_hist_frac to get_length_hist_summ and rewrote
comments about it. Hope it becomes more understandable.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
range_stat-0.10.patch.gz	application/x-gzip	8.9 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2013-02-13 13:28:36
Message-ID:	511B9504.4080004@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 04.01.2013 10:42, Alexander Korotkov wrote:
> /*
> * Calculate selectivity of "&&" operator using histograms of range lower bounds
> * and histogram of range lengths.
> */
> static double
> calc_hist_selectivity_overlap(TypeCacheEntry *typcache, RangeBound *lower,
> RangeBound *upper, RangeBound *hist_lower, int hist_nvalues,
> Datum *length_hist_values, int length_hist_nvalues)

We already have code to estimate &&, based on the lower and upper bound
histograms:

> case OID_RANGE_OVERLAP_OP:
> case OID_RANGE_CONTAINS_ELEM_OP:
> /*
> * A && B <=> NOT (A <> B).
> *
> * "range @> elem" is equivalent to "range && [elem,elem]". The
> * caller already constructed the singular range from the element
> * constant, so just treat it the same as &&.
> */
> hist_selec =
> calc_hist_selectivity_scalar(typcache, &const_lower, hist_upper,
> nhist, false);
> hist_selec +=
> (1.0 - calc_hist_selectivity_scalar(typcache, &const_upper, hist_lower,
> nhist, true));
> hist_selec = 1.0 - hist_selec;
> break;

I don't think the method based on lower bound and length histograms is
any better. In fact, my gut feeling is that it's less accurate. I'd
suggest dropping that part of the patch.

- Heikki

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2013-02-13 13:55:25
Message-ID:	CAPpHfdskfvxxJ7b=GFz1WQDHQnvNMEegVKm=SnqNQX9T3grnOg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 13, 2013 at 5:28 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 04.01.2013 10:42, Alexander Korotkov wrote:
>
>> /*
>> * Calculate selectivity of "&&" operator using histograms of range lower
>> bounds
>> * and histogram of range lengths.
>> */
>> static double
>> calc_hist_selectivity_overlap(**TypeCacheEntry *typcache, RangeBound
>> *lower,
>> RangeBound *upper, RangeBound
>> *hist_lower, int hist_nvalues,
>> Datum
>> *length_hist_values, int length_hist_nvalues)
>>
>
> We already have code to estimate &&, based on the lower and upper bound
> histograms:
>
> case OID_RANGE_OVERLAP_OP:
>> case OID_RANGE_CONTAINS_ELEM_OP:
>> /*
>> * A && B <=> NOT (A <> B).
>> *
>> * "range @> elem" is equivalent to "range &&
>> [elem,elem]". The
>> * caller already constructed the singular range
>> from the element
>> * constant, so just treat it the same as &&.
>> */
>> hist_selec =
>> calc_hist_selectivity_scalar(**typcache,
>> &const_lower, hist_upper,
>>
>> nhist, false);
>> hist_selec +=
>> (1.0 - calc_hist_selectivity_scalar(**typcache,
>> &const_upper, hist_lower,
>>
>> nhist, true));
>> hist_selec = 1.0 - hist_selec;
>> break;
>>
>
> I don't think the method based on lower bound and length histograms is any
> better. In fact, my gut feeling is that it's less accurate. I'd suggest
> dropping that part of the patch.
>

Right. This estimation has an accuracy of histogram, while estimation based
on lower bound and length histograms rely on additional assumption about
independence of lower bound and length histogram. We can sum A <> B probabilities because they are mutually exclusive. It's pretty evident
but I would like to mention it in the comments, because typical assumption
about events in statistics calculation is their independence.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2013-03-01 14:22:18
Message-ID:	CAPpHfdt92Fh5Jbovn-EXM_7Yv4Mhr6CvaVo=37jE85h1G2RGkA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 13, 2013 at 5:55 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Wed, Feb 13, 2013 at 5:28 PM, Heikki Linnakangas <
> hlinnakangas(at)vmware(dot)com> wrote:
>
>> On 04.01.2013 10:42, Alexander Korotkov wrote:
>>
>>> /*
>>> * Calculate selectivity of "&&" operator using histograms of range
>>> lower bounds
>>> * and histogram of range lengths.
>>> */
>>> static double
>>> calc_hist_selectivity_overlap(**TypeCacheEntry *typcache, RangeBound
>>> *lower,
>>> RangeBound *upper, RangeBound
>>> *hist_lower, int hist_nvalues,
>>> Datum
>>> *length_hist_values, int length_hist_nvalues)
>>>
>>
>> We already have code to estimate &&, based on the lower and upper bound
>> histograms:
>>
>> case OID_RANGE_OVERLAP_OP:
>>> case OID_RANGE_CONTAINS_ELEM_OP:
>>> /*
>>> * A && B <=> NOT (A <> B).
>>> *
>>> * "range @> elem" is equivalent to "range &&
>>> [elem,elem]". The
>>> * caller already constructed the singular range
>>> from the element
>>> * constant, so just treat it the same as &&.
>>> */
>>> hist_selec =
>>> calc_hist_selectivity_scalar(**typcache,
>>> &const_lower, hist_upper,
>>>
>>> nhist, false);
>>> hist_selec +=
>>> (1.0 - calc_hist_selectivity_scalar(**typcache,
>>> &const_upper, hist_lower,
>>>
>>> nhist, true));
>>> hist_selec = 1.0 - hist_selec;
>>> break;
>>>
>>
>> I don't think the method based on lower bound and length histograms is
>> any better. In fact, my gut feeling is that it's less accurate. I'd suggest
>> dropping that part of the patch.
>>
>
> Right. This estimation has an accuracy of histogram, while estimation
> based on lower bound and length histograms rely on additional assumption
> about independence of lower bound and length histogram. We can sum A < and A >> B probabilities because they are mutually exclusive. It's pretty
> evident but I would like to mention it in the comments, because typical
> assumption about events in statistics calculation is their independence.
>

These changes were made in attached patch.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
range_stat-0.11.patch.gz	application/x-gzip	8.5 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2013-03-12 11:39:42
Message-ID:	513F13FE.3020509@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01.03.2013 16:22, Alexander Korotkov wrote:
> These changes were made in attached patch.

Thanks.

I've been staring at this code for a very long time now, trying to
understand how the math in calc_hist_selectivity_contained works. I
think I understand it now, but it probably needs a lot more comments and
perhaps some refactoring, so that the next reader won't need to spend
hours deciphering it.

I'll walk through an example of a calc_hist_selectivity_contained
invocation, to verify that my understanding is correct. This isn't 100%
identical to how the function works; I explain it as if it holds some
temporary bins in memory and modifies them in steps, but in reality, it
keeps the bin distances and some other information only for the
current/previous bin it's processing, in local variables.

Assume that the query is "col <@ int4range(15, 50)", and the lower
bounds histogram is (10, 20, 40, 100, 120). Visually, the histogram
looks like this:

Boundary 10 20 40 100 120
-+------+------+------+------+-
Fraction 0.25 0.25 0.25 0.25

Each bin, 10-20, 20-40, 40-100 and 100-120, contains the same
proportion, 25%, of all the tuples in the table. The function first
finds the bins containing the lower and upper bounds, 15 and 55. All the
bins outside those bounds are ignored, as there are no matching tuples
there. The fractions and the bounds of first and last bin, ie. those
containing the lower and upper bounds, are adjusted according to the
boundary values, using linear interpolation. The lower bound, 15, falls
in the middle of the bin 10-20, and the upper bound, 55, splits the
40-100 bin at ratio of 1/5. The adjusted bins look like this:

Boundary 15 20 40 55
-+------+------+------+
Fraction 0.125 0.25 0.05

Next, we need to calculate what proportion of tuples in each bin has a
small enough length to be contained withing the query range. For that,
the distance of each bin boundary to the upper bound is calculated:

Distance 40 35 15 0
-+------+------+------+
Fraction 0.125 0.25 0.05

The bins are walked starting from the highest bin, ie. starting from
distance 0, walking up in increasing order of distance. For each bin,
the proportion of tuples within that range that have a suitable length
is calculated. The calc_length_hist_frac function does that. That
calculation is more complicated than it sounds: for example, for the
middle bin above, calc_length_hist_frac is passed both distance
boundaries, 15 and 35. calc_length_hist frac calculates the average of
P(x), when x slides linearly from 15 to 35, where P(x) is the fraction
of tuples with length <= x.

Now, here's a question, on something I didn't quite understand:

> * Returns average fraction of histogram which is greater than given length.
> * While this length is increasing from length1 to *length2. If histogram
> * ends up before *length2 then set covered fraction of (length1, *length2)
> * interval to *fraction and set end of histogram to *length2.
> */
> static double
> calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
> double length1, double *length2, double *fraction)
> {

Why the behavior explained in the last sentence in the above comment? It
seems that the abstraction provided by calc_length_hist_frac() is leaky;
the caller shouldn't need to know anything about the boundaries of the
length bins.

Ignoring that, I believe that calc_length_hist_frac can also be
explained like this:

> /*
> * Let P(x) be the fraction of tuples with length <= x.
> *
> * calc_length_hist_frac calculates the average of P(x), in the interval [A, B].
> *
> * This can be calculated by the formula:
> *
> * B
> * 1 /
> * ------- | P(x)dx
> * B - A /
> * A
> */
> static double
> calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
> double A, double B)

Am I correct this far? The function doesn't use the above formula as is,
but it could..

I'll continue trying to understand this and add comments..

- Heikki

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2013-03-12 16:03:50
Message-ID:	513F51E6.6040004@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01.03.2013 16:22, Alexander Korotkov wrote:
> I've been staring at this code for a very long time now, trying to
> understand how the math in calc_hist_selectivity_contained works. I
> think I understand it now, but it probably needs a lot more comments and
> perhaps some refactoring, so that the next reader won't need to spend
> hours deciphering it.
>
> I'll walk through an example of a calc_hist_selectivity_contained
> invocation, to verify that my understanding is correct. This isn't 100%
> identical to how the function works; I explain it as if it holds some
> temporary bins in memory and modifies them in steps, but in reality, it
> keeps the bin distances and some other information only for the
> current/previous bin it's processing, in local variables.
>
> Assume that the query is "col <@ int4range(15, 50)", and the lower
> bounds histogram is (10, 20, 40, 100, 120). Visually, the histogram
> looks like this:
>
>
> Boundary 10 20 40 100 120
> -+------+------+------+------+-
> Fraction 0.25 0.25 0.25 0.25
>
> Each bin, 10-20, 20-40, 40-100 and 100-120, contains the same
> proportion, 25%, of all the tuples in the table. The function first
> finds the bins containing the lower and upper bounds, 15 and 55. All the
> bins outside those bounds are ignored, as there are no matching tuples
> there. The fractions and the bounds of first and last bin, ie. those
> containing the lower and upper bounds, are adjusted according to the
> boundary values, using linear interpolation. The lower bound, 15, falls
> in the middle of the bin 10-20, and the upper bound, 55, splits the
> 40-100 bin at ratio of 1/5. The adjusted bins look like this:
>
> Boundary 15 20 40 55
> -+------+------+------+
> Fraction 0.125 0.25 0.05
>
> Next, we need to calculate what proportion of tuples in each bin has a
> small enough length to be contained withing the query range. For that,
> the distance of each bin boundary to the upper bound is calculated:
>
> Distance 40 35 15 0
> -+------+------+------+
> Fraction 0.125 0.25 0.05
>
> The bins are walked starting from the highest bin, ie. starting from
> distance 0, walking up in increasing order of distance. For each bin,
> the proportion of tuples within that range that have a suitable length
> is calculated. The calc_length_hist_frac function does that. That
> calculation is more complicated than it sounds: for example, for the
> middle bin above, calc_length_hist_frac is passed both distance
> boundaries, 15 and 35. calc_length_hist frac calculates the average of
> P(x), when x slides linearly from 15 to 35, where P(x) is the fraction
> of tuples with length <= x.
>
> Now, here's a question, on something I didn't quite understand:
>
>> * Returns average fraction of histogram which is greater than given
>> length.
>> * While this length is increasing from length1 to *length2. If histogram
>> * ends up before *length2 then set covered fraction of (length1,
>> *length2)
>> * interval to *fraction and set end of histogram to *length2.
>> */
>> static double
>> calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
>> double length1, double *length2, double *fraction)
>> {
>
> Why the behavior explained in the last sentence in the above comment? It
> seems that the abstraction provided by calc_length_hist_frac() is leaky;
> the caller shouldn't need to know anything about the boundaries of the
> length bins.
>
> Ignoring that, I believe that calc_length_hist_frac can also be
> explained like this:
>
>> /*
>> * Let P(x) be the fraction of tuples with length <= x.
>> *
>> * calc_length_hist_frac calculates the average of P(x), in the
>> interval [A, B].
>> *
>> * This can be calculated by the formula:
>> *
>> * B
>> * 1 /
>> * ------- | P(x)dx
>> * B - A /
>> * A
>> */
>> static double
>> calc_length_hist_frac(Datum *length_hist_values, int length_hist_nvalues,
>> double A, double B)
>
> Am I correct this far? The function doesn't use the above formula as is,
> but it could..
>
> I'll continue trying to understand this and add comments..

So, after some hacking, I ended up with this version. I find it more
readable, I hope I didn't miss anything. This seems to produce results
that are close, but not identical, to the original patch. I'm not sure
where the discrepancy is coming from, or which patch is more correct in
that respect. I'll continue from this tomorrow, but if you have the
time, please take a look and let me know what you think.

- Heikki

Attachment	Content-Type	Size
rangestats-0.12-heikki.patch	text/x-diff	27.1 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2013-03-13 06:44:14
Message-ID:	CAPpHfduTp5NOx_gbwfAxRc_OeOtN2_Yhtqo3o0S=x70ovtziww@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

Thanks for your work on this patch!

On Tue, Mar 12, 2013 at 8:03 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> So, after some hacking, I ended up with this version. I find it more
> readable, I hope I didn't miss anything. This seems to produce results that
> are close, but not identical, to the original patch. I'm not sure where the
> discrepancy is coming from, or which patch is more correct in that respect.
> I'll continue from this tomorrow, but if you have the time, please take a
> look and let me know what you think.
>

I've read your explanation and version of patch. In general it seems
correct to me.
There is one point why I have breaked up abstraction in some functions is
infinities. For example, in calc_length_hist_frac one or both of length1
and length2 can be infinity. In the line
> frac = area / (length2 - length1);
you can get NaN result. I've especially adjusted the code to get more of
less correct result in this case.

Another minor note about this line
> bin_width *= get_position(typcache, lower, &hist_lower[i],
> &hist_lower[i + 1]);
ITSM it sould looks like
> bin_width -= 1.0 - get_position(typcache, lower, &hist_lower[i],
> &hist_lower[i + 1]);
Imagine lower and upper bounds fall into same histogram bin. In this case
we should subtract lengths of both parts which were cut in the left and in
the right.

------
With best regards,
Alexander Korotkov.

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2013-03-13 19:10:04
Message-ID:	5140CF0C.2080506@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01.03.2013 16:22, Alexander Korotkov wrote:
> On Tue, Mar 12, 2013 at 8:03 PM, Heikki Linnakangas<hlinnakangas(at)vmware(dot)com
>> wrote:
>
>> So, after some hacking, I ended up with this version. I find it more
>> readable, I hope I didn't miss anything. This seems to produce results that
>> are close, but not identical, to the original patch. I'm not sure where the
>> discrepancy is coming from, or which patch is more correct in that respect.
>> I'll continue from this tomorrow, but if you have the time, please take a
>> look and let me know what you think.
>
> I've read your explanation and version of patch. In general it seems
> correct to me.
> There is one point why I have breaked up abstraction in some functions is
> infinities. For example, in calc_length_hist_frac one or both of length1
> and length2 can be infinity. In the line
>> frac = area / (length2 - length1);
> you can get NaN result. I've especially adjusted the code to get more of
> less correct result in this case.

Hmm, good point. I think I managed to fix those cases in the attached
version. Is there any other corner case that I missed?

> Another minor note about this line
>> bin_width *= get_position(typcache, lower,&hist_lower[i],
>> &hist_lower[i + 1]);
> ITSM it sould looks like
>> bin_width -= 1.0 - get_position(typcache, lower,&hist_lower[i],
>> &hist_lower[i + 1]);
> Imagine lower and upper bounds fall into same histogram bin. In this case
> we should subtract lengths of both parts which were cut in the left and in
> the right.

Yes, true. There's one negation too many above, though; should be:

bin_width -= get_position(typcache, lower,&hist_lower[i],
&hist_lower[i + 1]);

Fixed that. Barring any more issues, I'll read through this once more
tomorrow and commit.

- Heikki

Attachment	Content-Type	Size
rangestats-0.13-heikki.patch	text/x-diff	28.4 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2013-03-13 20:31:18
Message-ID:	CAPpHfdt4hY9AoWAjoEmNKiHn_VA=E+hf6==ZFBhJbtnEu9cSsA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 13, 2013 at 11:10 PM, Heikki Linnakangas <
hlinnakangas(at)vmware(dot)com> wrote:

> On 01.03.2013 16:22, Alexander Korotkov wrote:
>
>> On Tue, Mar 12, 2013 at 8:03 PM, Heikki Linnakangas<hlinnakangas(at)**
>> vmware.com <hlinnakangas(at)vmware(dot)com>
>>
>>> wrote:
>>>
>>
>> So, after some hacking, I ended up with this version. I find it more
>>> readable, I hope I didn't miss anything. This seems to produce results
>>> that
>>> are close, but not identical, to the original patch. I'm not sure where
>>> the
>>> discrepancy is coming from, or which patch is more correct in that
>>> respect.
>>> I'll continue from this tomorrow, but if you have the time, please take a
>>> look and let me know what you think.
>>>
>>
>> I've read your explanation and version of patch. In general it seems
>> correct to me.
>> There is one point why I have breaked up abstraction in some functions is
>> infinities. For example, in calc_length_hist_frac one or both of length1
>> and length2 can be infinity. In the line
>>
>>> frac = area / (length2 - length1);
>>>
>> you can get NaN result. I've especially adjusted the code to get more of
>> less correct result in this case.
>>
>
> Hmm, good point. I think I managed to fix those cases in the attached
> version. Is there any other corner case that I missed?
>

Did you try test case by Jeff Davis on this thread?
http://www.postgresql.org/message-id/1355167304.3896.37.camel@jdavis
I try it with attached version of patch and get NaN estimate.

------
With best regards,
Alexander Korotkov.

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2013-03-14 13:44:19
Message-ID:	5141D433.6020709@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01.03.2013 16:22, Alexander Korotkov wrote:
> On Wed, Mar 13, 2013 at 11:10 PM, Heikki Linnakangas<
> hlinnakangas(at)vmware(dot)com> wrote:
>
>> On 01.03.2013 16:22, Alexander Korotkov wrote:
>>
>>>> frac = area / (length2 - length1);
>>>>
>>> you can get NaN result. I've especially adjusted the code to get more of
>>> less correct result in this case.
>>
>> Hmm, good point. I think I managed to fix those cases in the attached
>> version. Is there any other corner case that I missed?
>
> Did you try test case by Jeff Davis on this thread?
> http://www.postgresql.org/message-id/1355167304.3896.37.camel@jdavis
> I try it with attached version of patch and get NaN estimate.

Thanks, fixed that too.

Committed with a little bit more clean up and fixes. Thank you for
bearing with this long process :-). And many thanks Jeff for the review,
and sorry that I forgot to credit you for that in the commit message.

- Heikki