Estimating geometric distributions

From: "Stephen Denne" <Stephen(dot)Denne(at)datamail(dot)co(dot)nz>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Estimating geometric distributions
Date: 2008-03-09 22:41:16
Message-ID: F0238EBA67824444BC1CB4700960CB4804D2F0EE@dmpeints002.isotach.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> I have a field whose distribution of frequencies of values is
> roughly geometric, rather than flat.
> Total rows = 36 million
> relpages=504864
> Distinct field values in use = 169
> 10 values account for 50% of the rows.
> 41 values account for 90% of the rows.
>
> After setting statistics target to 1000 for that field, and
> analyzing the table, the statistics row for that field had 75
> most frequent values and a histogram with 77 entries in it.
> Estimating 152 values in total.

"public";"mytable";"myfield";0;4;152;"{202,179,8,181,173,207,6,118,107,205,182,4,54,247,168,77,169,53,120,159,149,174,167,156,148,150,56,108,66,119,5,99,96,175,97,208,1,130,10,102,228,101,121,50,11,152,32,12,78,221,55,244,241,252,203,116,103,184,154,153,238,65,49,220,83,98,111,85,139,242,240,260,7,109,114}";"{0.0836433,0.0781667,0.0738367,0.0598533,0.04629,0.04447,0.0359833,0.0314267,0.0278333,0.0268,0.0251433,0.0244867,0.02438,0.0223433,0.0207567,0.0189667,0.0168833,0.01582,0.0150267,0.0141767,0.0130467,0.0128933,0.0125767,0.0123567,0.0116567,0.0114967,0.01048,0.01037,0.00994667,0.00987667,0.00977667,0.00965333,0.00916333,0.00828667,0.00732667,0.00712,0.00629,0.00624,0.00576667,0.00558667,0.00477667,0.00475333,0.00410333,0.00405667,0.00371667,0.00334667,0.00334,0.00312667,0.00312667,0.00302,0.00300333,0.00295,0.00287333,0.00271,0.00267,0.00240667,0.00224,0.00221333,0.00215333,0.0021,0.00205667,0.00202667,0.00197333,0.00197333,0.00168667,0.00166,0.00159333,0.00159,0.00154667,0.00150333,0.00149,0.00133333,0.00132,0.00112667,0.00104}";"{2,9,9,9,67,76,84,84,86,87,87,88,95,100,100,100,104,105,105,110,112,112,128,137,137,138,143,144,144,144,151,155,155,155,157,157,158,171,171,183,185,185,185,185,187,194,199,199,200,200,201,204,204,209,209,214,214,214,214,215,217,225,225,225,229,239,239,249,250,250,253,253,255,257,261,262,266}";0.449246

My problem is frequent
> over-estimation of rows when restricting by this field with
> values not known at plan time.

examples:
select * from mytable where myfield = ?;
select * from mytable where myfield in (subquery);

My arithmetic mean of the frequencies is 214200
My geometric mean is 13444

However analyze didn't find all my values, and thinks that there are only 152 of them, so it uses a mean of 238046

When the subquery is estimated to return three myfield values, the query estimates 714138 rows, and chooses a sequential scan over mytable (myfield is indexed).

explain select * from mytable where myfield in (values (1),(2),(3));

Hash IN Join (cost=0.08..1009521.37 rows=714138 width=86)
Hash Cond: (mytable.myfield = "*VALUES*".column1)
-> Seq Scan on mytable (cost=0.00..866693.76 rows=36182976 width=86)
-> Hash (cost=0.04..0.04 rows=3 width=4)
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4)

I think this query is much more likely to return around 40000 rows, and a Bitmap Index Scan should be used.

explain select * from mytable where myfield in (values (1),(2));

Nested Loop (cost=4445.11..931383.93 rows=476092 width=86)
-> Unique (cost=0.04..0.04 rows=2 width=4)
-> Sort (cost=0.04..0.04 rows=2 width=4)
Sort Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=4)
-> Bitmap Heap Scan on mytable (cost=4445.08..462716.37 rows=238046 width=86)
Recheck Cond: (mytable.myfield = "*VALUES*".column1)
-> Bitmap Index Scan on myindex (cost=0.00..4385.56 rows=238046 width=0)
Index Cond: (mytable.myfield = "*VALUES*".column1)

The expected number of loops (2 here, 3 above) through the Bitmap Heap Scan * 462716.37 > 1009521.37, but the cost estimate is far too high in the general case. It should be closer to 26000 per loop if adjusted for my expectation of the number of rows, being 13444 per loop. As such, you should need to expect close to 40 myfield values being returned by the subquery before choosing a sequential scan.

Is there any facility already in PostgreSQL to help me here?

Hopefully an index type that I don't know about yet? (Geometric distributions are similar to those found in word count distributions).

If not... is there any merit in this idea:

During the analyze process, the geometric mean of sampled rows was calculated, and if determined to be significantly different from the arithmetic mean, stored in a new stats column. When estimating the number of rows that will be returned by queries of the form shown above, if there is a geometric mean stored, use it instead of the arithmetic mean.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any attachments is confidential and may be subject to legal privilege. If it is not intended for you please advise by reply immediately, destroy it and do not copy, disclose or use it in any way.

__________________________________________________________________
This email has been scanned by the DMZGlobal Business Quality
Electronic Messaging Suite.
Please see http://www.dmzglobal.com/services/bqem.htm for details.
__________________________________________________________________

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-03-09 23:49:30 Cleaning up the commit fest to-do list
Previous Message Dawid Kuroczko 2008-03-09 21:45:59 Re: Lazy constraints / defaults