Re: On Distributions In 7.2.1

Lists: pgsql-general
From: Mark kirkwood <markir(at)slingshot(dot)co(dot)nz>
To: pgsql-general(at)postgresql(dot)org
Subject: On Distributions In 7.2.1
Date: 2002-05-02 04:17:47
Message-ID: 1020313069.2325.19.camel@spikey.slithery.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Dear list,

I thought I would follow up my last posting on this topic by looking at 7.2.1 -
as there is a new estimator for distinct values (see src/backend/commands/analyze.c
~line 1010 onwards).

The specimen table is the same as was used in the 7.2 post...to recap briefly :

dbbench3=> \d fact0
Table "fact0"
Column | Type | Modifiers
--------+---------+-----------
d0key | integer |
d1key | integer |
d2key | integer |
val | integer |
filler | text |
Unique keys: fact0_pk

dbbench3=> select count(*) from fact0;
count
---------
3000000
(1 row)

dbbench3=> select count(distinct d0key) from fact0;
count
-------
3000
(1 row)

The data is generated with each value occurring the same number of
times - so the frequency of any particular value is :

dbbench3=> select 3000.0/3000000.0;
?column?
----------
0.001
(1 row)

The d0key values are uniformly distributed through the table, so it should be a
"good" test of the statistical sampling routines used by ANALYZE :

dbbench3=> alter table fact0 alter d0key set statistics 10;
ALTER
dbbench3=> analyze fact0;
ANALYZE
dbbench3=> select tablename,attname,n_distinct,most_common_freqs from pg_stats
where attname='d0key';

fact0 | d0key | 3128 |
{0.00266667,0.00233333,0.002,0.00166667,0.00166667,0.00166667,0.00166667,
0.00166667,0.00166667,0.00166667}
(1 row)

dbbench3=> alter table fact0 alter d0key set statistics 100;
ALTER
dbbench3=> analyze fact0;
ANALYZE
dbbench3=> select tablename,attname,n_distinct,most_common_freqs
from pg_stats where attname='d0key';

fact0 | d0key | 3000 |
{0.0007,0.0007,0.0007,0.0007,0.000666667,0.000666667,0.000666667,0.000666667,
0.000666667,0.000666667,0.000633333,0.000633333,0.000633333,0.000633333,0.000633333,
0.000633333,0.000633333,0.000633333,0.000633333,0.000633333,0.000633333,0.000633333,
0.000633333,0.000633333,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,
0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.0006,0.000566667,0.000566667,0.000566667,
0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,
0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,
0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,
0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,0.000566667,
0.000566667,0.000566667,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,
0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,
0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,
0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333,0.000533333}
(1 row)

So the number distinct is extremely accurate, even with 10 quantiles (in 7.2 it was
overestimated by a factor of 10).

There is slightly odd behaviour with the frequencies decreasing with increasing number
of quantiles (same as 7.2 .. same code here ?).
The frequencies are estimated best for about 30 quantiles - all 30 are between 0.0009
and 0.001 in this case.

I am wondering if this is caused by my example not having any "real" most
common values (they are all as common as each other).

I am going to fiddle with my data generation script, skew the distribution and
see what effect that has.

best wishes

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark kirkwood <markir(at)slingshot(dot)co(dot)nz>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: On Distributions In 7.2.1
Date: 2002-05-02 05:00:51
Message-ID: 4886.1020315651@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Mark kirkwood <markir(at)slingshot(dot)co(dot)nz> writes:
> There is slightly odd behaviour with the frequencies decreasing with
> increasing number of quantiles (same as 7.2 .. same code here ?).

That does seem curious. With the inevitable sampling error, you'd
expect that some values would be sampled at a bit more than their
true frequency, and others at a bit less. The oversampled ones would
be the ones to get into the MCV list. But what you've got here is
that even the most-commonly-sampled value showed up at a bit less
than its true frequency. Is this repeatable if you do ANALYZE over
and over? Maybe it was just a statistical fluke.

> I am wondering if this is caused by my example not having any "real" most
> common values (they are all as common as each other).
> I am going to fiddle with my data generation script, skew the
> distribution and see what effect that has.

Someone else reported some results that made it look like a logarithmic
frequency distribution was a difficult case for the stats gatherer:
http://archives.postgresql.org/pgsql-general/2002-03/msg01300.php
So please be sure to try that.

regards, tom lane


From: Mark kirkwood <markir(at)slingshot(dot)co(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: On Distributions In 7.2.1
Date: 2002-05-02 09:36:37
Message-ID: 1020332199.3822.16.camel@spikey.slithery.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

err - some of my puzzlement is explained by this (silly) error :

there are 3000 distinct values, each with 1000 occurrences (*not* 3000)
so the frequency I am "chasing" is 0.00033... ( not 0.001)

so for 10 (quantiles) we see ~ 0.0016 (too big)
for 30 seeing ~ 0.001 (too big but closer)
for 100 seeing ~ 0.00066 (still too big but closer)

so actually its "creeping" closer (rather than oscillating), which seems
a much healthier situation.

However Tom's observation is still valid (in spite of my math) - all the
frequencies are overestimated, rather than the expected "some bigger,
some smaller" sort of thing.

I will do some more ANALYZE runs and see what happens...

(then the log distribution could be fun)

regards

Mark

>On Thu, 2002-05-02 at 17:00, Tom Lane wrote:
> Mark kirkwood <markir(at)slingshot(dot)co(dot)nz> writes:
> > There is slightly odd behaviour with the frequencies decreasing with
> > increasing number of quantiles (same as 7.2 .. same code here ?).
>
> That does seem curious. With the inevitable sampling error, you'd
> expect that some values would be sampled at a bit more than their
> true frequency, and others at a bit less. The oversampled ones would
> be the ones to get into the MCV list. But what you've got here is
> that even the most-commonly-sampled value showed up at a bit less
> than its true frequency. Is this repeatable if you do ANALYZE over
> and over? Maybe it was just a statistical fluke.
>
> > I am wondering if this is caused by my example not having any "real" most
> > common values (they are all as common as each other).
> > I am going to fiddle with my data generation script, skew the
> > distribution and see what effect that has.
>
> Someone else reported some results that made it look like a logarithmic
> frequency distribution was a difficult case for the stats gatherer:
> http://archives.postgresql.org/pgsql-general/2002-03/msg01300.php
> So please be sure to try that.
>
> regards, tom lane
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark kirkwood <markir(at)slingshot(dot)co(dot)nz>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: On Distributions In 7.2.1
Date: 2002-05-02 14:11:50
Message-ID: 7233.1020348710@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Mark kirkwood <markir(at)slingshot(dot)co(dot)nz> writes:
> However Tom's observation is still valid (in spite of my math) - all the
> frequencies are overestimated, rather than the expected "some bigger,
> some smaller" sort of thing.

No, that makes sense. The values that get into the most-common-values
list are only going to be ones that are significantly more common (in
the sample) than the estimated average frequency. So if the thing makes
a good estimate of the average frequency, you'll only see upside
outliers in the MCV list. The relevant logic is in analyze.c:

/*
* Decide how many values are worth storing as most-common values.
* If we are able to generate a complete MCV list (all the values
* in the sample will fit, and we think these are all the ones in
* the table), then do so. Otherwise, store only those values
* that are significantly more common than the (estimated)
* average. We set the threshold rather arbitrarily at 25% more
* than average, with at least 2 instances in the sample. Also,
* we won't suppress values that have a frequency of at least 1/K
* where K is the intended number of histogram bins; such values
* might otherwise cause us to emit duplicate histogram bin
* boundaries.
*/

regards, tom lane


From: david blood <davidjblood(at)yahoo(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Tracking down Database growth
Date: 2002-05-02 15:41:41
Message-ID: 20020502154141.50439.qmail@web20603.mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Any body know of some scripts that will help me track down where my DB is
growing?

Thanks David Blood

__________________________________________________
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com


From: Mark kirkwood <markir(at)slingshot(dot)co(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: On Distributions In 7.2.1
Date: 2002-05-04 04:49:24
Message-ID: 1020487765.1259.23.camel@spikey.slithery.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

>On Fri, 2002-05-03 at 02:11, Tom Lane wrote:
> Mark kirkwood <markir(at)slingshot(dot)co(dot)nz> writes:
> > However Tom's observation is still valid (in spite of my math) - all the
> > frequencies are overestimated, rather than the expected "some bigger,
> > some smaller" sort of thing.
>
> No, that makes sense. The values that get into the most-common-values
> list are only going to be ones that are significantly more common (in
> the sample) than the estimated average frequency. So if the thing makes
> a good estimate of the average frequency, you'll only see upside
> outliers in the MCV list. The relevant logic is in analyze.c:
>
>(snippage)

oh I see... I am thinking that a larger sample may reduce the likelihood
of sample values *not* being included in my MCV list. A quick ALTER ....
SET STATISTICS 500 + ANALYZE results in frequencies of around the 0.0004
area for each MCV - closer to the actual of 0.00033.

So the frequency estimation algorithm is behaving well in this case, and
appears to be coverging to the correct results (or within a sensible
neighbourhood of them...). So for this type of data (uniformly
distributed keys) one can ANALYZE with confidence...

I notice that for a resonably large number of keys + big table
conbination using 100 quantiles gives much more accurate frequencies -
I wonder if its worth a mention in the docs to the effect :

"ANALYZE with the default (10 or so) is ok for most cases, but for big
tables consider using ALTER ... SET STATISTICS 100 for commonly JOINed
or WHEREed columns"

(Next ...the logarithmic guy...)

regards

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark kirkwood <markir(at)slingshot(dot)co(dot)nz>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: On Distributions In 7.2.1
Date: 2002-05-04 15:08:07
Message-ID: 27010.1020524887@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Mark kirkwood <markir(at)slingshot(dot)co(dot)nz> writes:
> I wonder if its worth a mention in the docs to the effect :

> "ANALYZE with the default (10 or so) is ok for most cases, but for big
> tables consider using ALTER ... SET STATISTICS 100 for commonly JOINed
> or WHEREed columns"

Could be. So far we have very little experience with setting the
statistics target --- the default of 10 was just picked out of the
air, and might not be the best general-purpose target. Or it might
be that 10 is fine, but the multiplier that's used to go from that
to the number of rows to sample isn't large enough. The current
multiplier is 300, ie, we sample 3000 rows by default:

/*--------------------
* The following choice of minrows is based on the paper
* "Random sampling for histogram construction: how much is enough?"
* by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
* Proceedings of ACM SIGMOD International Conference on Management
* of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5
* says that for table size n, histogram size k, maximum relative
* error in bin size f, and error probability gamma, the minimum
* random sample size is
* r = 4 * k * ln(2*n/gamma) / f^2
* Taking f = 0.5, gamma = 0.01, n = 1 million rows, we obtain
* r = 305.82 * k
* Note that because of the log function, the dependence on n is
* quite weak; even at n = 1 billion, a 300*k sample gives <= 0.59
* bin size error with probability 0.99. So there's no real need to
* scale for n, which is a good thing because we don't necessarily
* know it at this point.
*--------------------
*/
stats->minrows = 300 * attr->attstattarget;

f = 0.5 isn't a very ambitious target, but because of the 1/f^2
dependency, pushing it down to say 0.1 would be expensive --- the
multiplier would become 7650 or so instead of 300.

It could be that we need two knobs here not just one --- perhaps people
want to control the accuracy of the stats without necessarily making
the number of values stored larger. Or it could be that we don't
need to do anything. The values you were quoting look plenty close
enough to reality for the planner's purposes. (But this is a
particularly simple case; harder cases may show worse error...)

regards, tom lane