Perl Standard Deviation function is wrong !

Lists: pgsql-hackers
From: Andreas Zeugswetter <andreas(dot)zeugswetter(at)telecom(dot)at>
To: "'colink(at)latticesemi(dot)com'" <colink(at)latticesemi(dot)com>, "'jason(at)wagner(dot)com'" <jason(at)wagner(dot)com>
Cc: "'dg(at)illustra(dot)com'" <dg(at)illustra(dot)com>, "'hackers(at)postgresql(dot)org'" <hackers(at)postgresql(dot)org>
Subject: Perl Standard Deviation function is wrong !
Date: 1998-06-05 10:05:18
Message-ID: 01BD907A.6B7A9F10@zeugswettera.user.lan.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

First of all I would like to thank you for your work on the Statistics Module.
Unfortunately a lot of books differ in their formula for variance and stdev.
In Europe the below corrected definition where stdev is not simply the sqrt of variance
seems to be more popular.
For large populations (>400) the calculation will be almost the same,
but for small populations (like 5) the below calculation will be different.

[Hackers] please forget my last mail to this subject. It was wrong.
Tanx
Andreas Zeugswetter

David Gould wrote:
>The Perl Module "Statistics/Descriptive" has on the fly variance calculation.
>
>sub add_data {
> my $self = shift; ##Myself
> my $oldmean;
> my ($min,$mindex,$max,$maxdex);
>
> ##Take care of appending to an existing data set
> $min = (defined ($self->{min}) ? $self->{min} : $_[0]);
> $max = (defined ($self->{max}) ? $self->{max} : $_[0]);
> $maxdex = $self->{maxdex} || 0;
> $mindex = $self->{mindex} || 0;
>
> ##Calculate new mean, pseudo-variance, min and max;
> foreach (@_) {
> $oldmean = $self->{mean};
> $self->{sum} += $_;
> $self->{count}++;
> if ($_ >= $max) {
> $max = $_;
> $maxdex = $self->{count}-1;
> }
> if ($_ <= $min) {
> $min = $_;
> $mindex = $self->{count}-1;
> }
> $self->{mean} += ($_ - $oldmean) / $self->{count};
> $self->{pseudo_variance} += ($_ - $oldmean) * ($_ - $self->{mean});
> }
>
> $self->{min} = $min;
> $self->{mindex} = $mindex;
> $self->{max} = $max;
> $self->{maxdex} = $maxdex;
> $self->{sample_range} = $self->{max} - $self->{min};
> if ($self->{count} > 1) {
> $self->{variance} = $self->{pseudo_variance} / ($self->{count} -1);
> $self->{standard_deviation} = sqrt( $self->{variance});

Most books state:
$self->{variance} = $self->{pseudo_variance} / $self->{count};
$self->{standard_deviation} = sqrt( $self->{pseudo_variance} / ( $self->{count} - 1 ))

> }
> return 1;
>}


From: Brook Milligan <brook(at)trillium(dot)NMSU(dot)Edu>
To: andreas(dot)zeugswetter(at)telecom(dot)at
Cc: colink(at)latticesemi(dot)com, jason(at)wagner(dot)com, dg(at)illustra(dot)com, hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Perl Standard Deviation function is wrong !
Date: 1998-06-05 15:16:27
Message-ID: 199806051516.JAA18629@trillium.nmsu.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>Variance is just square of std. dev, no?

No ! Stdev is divided by count, Variance by (count - 1)

I think the difference really has to do with what you are calculating.
If you want the std. dev./var. of the data THEMSELVES, divide by the
count. If you want an estimate about the properties of the POPULATION
from which the data were sampled, divide by count-1. People have
needs for both in different circumstances.

Perhaps there needs to be two versions, or a function argument, to
distinguish the two uses, both of which are legitimate.

Cheers,
Brook


From: Colin Kuskie <ckuskie(at)teleport(dot)com>
To: Brook Milligan <brook(at)trillium(dot)NMSU(dot)Edu>, andreas(dot)zeugswetter(at)telecom(dot)at, jason(at)wagner(dot)com, dg(at)illustra(dot)com, hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Perl Standard Deviation function is wrong !
Date: 1998-06-06 01:41:07
Message-ID: 35789E33.F9E1475A@teleport.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Brook Milligan wrote:
>
> >Variance is just square of std. dev, no?
>
> No ! Stdev is divided by count, Variance by (count - 1)
>
> I think the difference really has to do with what you are calculating.
> If you want the std. dev./var. of the data THEMSELVES, divide by the
> count. If you want an estimate about the properties of the POPULATION
> from which the data were sampled, divide by count-1. People have
> needs for both in different circumstances.
>
> Perhaps there needs to be two versions, or a function argument, to
> distinguish the two uses, both of which are legitimate.

Gentlemen,
First let me apologize if this conversation has been taking place in
the Perl newsgroups. You've caught me at a time when I'm sans news
reader. (I could use Netscape, but .... <shudder> and I'd be ignored
by most of the guru's in the group).

Back to the topic at hand. The module states its references for the
statistical formulae as well as its methods of calculation so you
should always know what you're getting.

I haven't done intensive statistics for a long time. I inherited the
module from Jason Kastner to add more methods to it and to see if I
could make some changes to the interface. Since then, I've released
several bug fixes caused by those changes. If the public demands
more statistics, then I'll make it so.

I'm a little leary of making changes without having some hard
references. If any of you would like to send me some (I'll be tracking
them down, too!) I'd appreciate it.

Once I have that warm fuzzy that I'm not just inventing mathematics,
then I'll change the methods for standard variation and variance to
accept a single argument that causes them to give the DATA statistics
instead of the population statistics. I can't see overhauling the
default behavior and forcing people to rewrite scripts already in place.
It made them angry enough when I changed the OO interface...

I look forward to hearing from you, or having results to share with
you, soon!

Colin Kuskie

p.s. I recently changed jobs. My new email address is:
ckuskie(at)cadence(dot)com A new release will give me the excuse to change
the modules documentation to reflect that.