Re: WIP: multivariate statistics / proof of concept

From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: multivariate statistics / proof of concept
Date: 2016-01-19 08:34:21
Message-ID: 569DF50D.9060906@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/12/14 05:53, Heikki Linnakangas wrote:
> On 10/13/2014 01:00 AM, Tomas Vondra wrote:
>> Hi,
>>
>> attached is a WIP patch implementing multivariate statistics.
>
> Great! Really glad to see you working on this.
>
>> + * FIXME This sample sizing is mostly OK when computing stats for
>> + * individual columns, but when computing multi-variate stats
>> + * for multivariate stats (histograms, mcv, ...) it's rather
>> + * insufficient. For small number of dimensions it works, but
>> + * for complex stats it'd be nice use sample proportional to
>> + * the table (say, 0.5% - 1%) instead of a fixed size.
>
> I don't think a fraction of the table is appropriate. As long as the
> sample is random, the accuracy of a sample doesn't depend much on the
> size of the population. For example, if you sample 1,000 rows from a
> table with 100,000 rows, or 1000 rows from a table with 100,000,000
> rows, the accuracy is pretty much the same. That doesn't change when
> you go from a single variable to multiple variables.
>
> You do need a bigger sample with multiple variables, however. My gut
> feeling is that if you sample N rows for a single variable, with two
> variables you need to sample N^2 rows to get the same accuracy. But
> it's not proportional to the table size. (I have no proof for that,
> but I'm sure there is literature on this.)
[...]

I did stage III statistics at University many moons ago...

The accuracy of the sample only depends on the value of N, not the total
size of the population, with the obvious constraint that N <= population
size.

The standard deviation in a random sample is proportional to the square
root of N. So using N = 100 would have a standard deviation of about
10%, so to reduce it to 5% you would need N = 400.

For multiple variables, it will also be a function of N - I don't recall
precisely how, I suspect it might M * N were M is the number of
parameters (but I'm not as certain). I think M^N might be needed if you
want all the possible correlations between sets of variable to be
reasonably significant - but I'm mostly just guessing here.

So using a % of table size is somewhat silly, looking at the above.
However, if you want to detect frequencies that occur at the 1% level,
then you will need to sample 1% of the table or greater. So which
approach is 'best', depends on what you are trying to determine. The
sample size is more useful when you need to decide between 2 different
hypothesises.

The sampling methodology, is far more important than the ratio of N to
population size - consider the bias imposed by using random telephone
numbers, even before the event of mobile phones!

Cheers,
Gavin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Marco Atzeri 2016-01-19 08:45:08 Re: Removing service-related code in pg_ctl for Cygwin
Previous Message Amit Kapila 2016-01-19 08:10:24 Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby