Re: [HACKERS] Implementing STDDEV and VARIANCE

Lists: pgsql-hackers
From: Jeroen van Vianen <jeroen(at)design(dot)nl>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Implementing STDDEV and VARIANCE
Date: 2000-01-23 16:56:12
Message-ID: 4.2.0.58.20000123170505.00954560@mail.design.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'd like to implement stddev and variance aggregates in Postgres. This is a
long standing TODO item.

There already has been some discussion on implementing this, see
http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html

There are two definitions for standard deviation and variance:
_
population variance = sigma^2 = SUM(X - X)^2 / N

population stddev = sqrt(population variance)
_
sample variance = s^2 = SUM(X - X)^2 / (N-1)

sample stddev = sqrt(sample variance)

These statistics can be calculated in one pass when three variables are
kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only
two variables are kept.

E.g. avg() is calculated as follows:

sx = 0
n = 0
for every row {
sx = sx + value in row // transition function 1
n = n+1 // transition function 2
}
avg = sum(x) / n

stddev / variance might be calculated as follows:

sx = 0
n = 0
sx2 = 0
for every row {
sx = sx + value in row // transition function 1
n = n+1 // transition function 2
sx2 = sx2 + value in row^2 // transition function 3
}
var = (1/n) * (sx2 - (1/n) * sx^2) // Population

or

var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample

and

stddev = sqrt(var)

I've looked through the code and the following things need to be implemented:

1. Add three columns to pg_aggregate for the additional third transition
function.

Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
>All that you need to implement this is room to keep two running
>sums instead of one. I haven't looked at pgsql's aggregate functions,
>but I'd hope that the working state can be a struct not just a
>single number.

I saw no other way than adding another transition function and logic, as
this might break user-defined aggregates (are there any around?).

2. Add logic to nodeAgg.c to execute the third transition function and
finalize function with three rather than two parameters
3. Add functions f(a,b) that returns a + b^2 for selected types
4. Add four finalize functions to calculate the variance / stddev
5. Update the code for create aggregate, to include the definition of the
third transition function
6. Update the documentation

My questions are:
1. Is this the correct way to continue? What am I missing? Any errors in my
reasoning?
2. I am proposing the names stddev(x) and variance(x) for population and
samplestddev(x) and
samplevariance(x) for sample statistics. Any comments?
3. I'm planning to implement this for types float4, float8 and numeric. Any
other types also? int[2,4,8] don't seem logical, as these would introduce
serious rounding errors.

Let me know what you think,

Jeroen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeroen van Vianen <jeroen(at)design(dot)nl>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Implementing STDDEV and VARIANCE
Date: 2000-01-23 18:27:01
Message-ID: 6270.948652021@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeroen van Vianen <jeroen(at)design(dot)nl> writes:
> I'd like to implement stddev and variance aggregates in Postgres. This is a
> long standing TODO item.

> 1. Add three columns to pg_aggregate for the additional third transition
> function.

> Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
>> All that you need to implement this is room to keep two running
>> sums instead of one. I haven't looked at pgsql's aggregate functions,
>> but I'd hope that the working state can be a struct not just a
>> single number.

> I saw no other way than adding another transition function and logic, as
> this might break user-defined aggregates (are there any around?).

Yes, there are some, and no you do not need a third transition
function. What you do need is a datatype holding two values that
you can use as the transition datatype, plus appropriate functions
for the transition functions.

The reason there are two transition functions at all is that it allows
some of the standard aggregate functions to be built using arithmetic
functions that exist anyway --- for example, float8 AVG is built from
float8 addition, float8 increment, and float8 divide, with only float8
increment being a function you wouldn't have anyway. However, the
whole thing is really a kluge; nodeAvg.c has all sorts of weird little
hacks that are necessary to make AVG have the right behavior in boundary
conditions such as no-tuples-in. (A blind application of float8 divide
would lead to a divide-by-zero exception in that case.) These hacks
limit the ability of user-defined aggregates to control their behavior
at the boundary conditions. Nor can an aggregate control its response
to NULL data values; that's hardwired into nodeAvg.c as well.

A cleaner solution would have just one transition function and one
transition data value, plus an optional finalization function that takes
only the one data value. For AVG the transition data type would be a
two-field struct and the transition function would update both fields.
This would halve the function-call overhead per tuple. We'd have to
provide specialized transition and finalization functions for AVG and
probably a couple of the other standard aggregates, but that would allow
us to rely on those functions to do the right things at the boundary
conditions; nodeAvg.c could stop foreclosing the choices.

I have been thinking about proposing such a change along with the
function manager rewrite that is now planned for 7.1. That would be
a good time because user-defined aggregates would need to be revisited
anyway. Also, if the transition functions are to determine the behavior
for no-tuples and for NULL data values, they had better be able to pass
and return NULLs cleanly; which depends on the function manager rewrite.

In short, I'd suggest thinking about implementing STDDEV with a
single transition function and transition data value. You'll need
specialized functions for it anyway, so I don't see that you're saving
any work by proposing a third transition function. What you will need
instead is a pg_type entry for the transition data type, but since that
data type needn't have any operators, there's not much work needed.

> 3. I'm planning to implement this for types float4, float8 and numeric. Any
> other types also? int[2,4,8] don't seem logical, as these would introduce
> serious rounding errors.

I'd suggest just two basic implementations, with float8 and numeric
internal calculations respectively. Data of other numeric types can
be type-coerced to one of these (that might even happen automatically),
but the output would always be either float8 or numeric. (I don't think
float4 has enough precision to generate reliable stddev numbers except
in very narrow conditions...)

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeroen van Vianen <jeroen(at)design(dot)nl>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Implementing STDDEV and VARIANCE
Date: 2000-01-23 18:55:52
Message-ID: 6353.948653752@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> whole thing is really a kluge; nodeAvg.c has all sorts of weird little

For "nodeAvg.c" read "nodeAgg.c"; sorry for the momentary brain fade...

regards, tom lane


From: Jeroen van Vianen <jeroen(at)design(dot)nl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Implementing STDDEV and VARIANCE
Date: 2000-01-23 20:41:35
Message-ID: 388B677F.381B10E4@design.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Jeroen van Vianen <jeroen(at)design(dot)nl> writes:
> > I'd like to implement stddev and variance aggregates in Postgres. This is a
> > long standing TODO item.
>
> > 1. Add three columns to pg_aggregate for the additional third transition
> > function.
>
> > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
> >> All that you need to implement this is room to keep two running
> >> sums instead of one. I haven't looked at pgsql's aggregate functions,
> >> but I'd hope that the working state can be a struct not just a
> >> single number.
>
> > I saw no other way than adding another transition function and logic, as
> > this might break user-defined aggregates (are there any around?).
>
> Yes, there are some, and no you do not need a third transition
> function. What you do need is a datatype holding two values that
> you can use as the transition datatype, plus appropriate functions
> for the transition functions.

So it might be better to have this type hold three values (n, sum(x) and
sum(x^2)) and only use one transition function to update all three
values at once and have the finalization function do the necessary
calculations.

> The reason there are two transition functions at all is that it allows
> some of the standard aggregate functions to be built using arithmetic
> functions that exist anyway --- for example, float8 AVG is built from
> float8 addition, float8 increment, and float8 divide, with only float8
> increment being a function you wouldn't have anyway. However, the
> whole thing is really a kluge; nodeAvg.c has all sorts of weird little
> hacks that are necessary to make AVG have the right behavior in boundary
> conditions such as no-tuples-in. (A blind application of float8 divide
> would lead to a divide-by-zero exception in that case.) These hacks
> limit the ability of user-defined aggregates to control their behavior
> at the boundary conditions. Nor can an aggregate control its response
> to NULL data values; that's hardwired into nodeAvg.c as well.

Yes, I saw these little hacks. And there are boundary conditions with
stddev and variance with no rows and one row (for sample stddev and
sample variance).

> A cleaner solution would have just one transition function and one
> transition data value, plus an optional finalization function that takes
> only the one data value. For AVG the transition data type would be a
> two-field struct and the transition function would update both fields.
> This would halve the function-call overhead per tuple. We'd have to
> provide specialized transition and finalization functions for AVG and
> probably a couple of the other standard aggregates, but that would allow
> us to rely on those functions to do the right things at the boundary
> conditions; nodeAvg.c could stop foreclosing the choices.

So you suggest changing all transition functions for 7.1 to keep all the
state they need?

> I have been thinking about proposing such a change along with the
> function manager rewrite that is now planned for 7.1. That would be
> a good time because user-defined aggregates would need to be revisited
> anyway. Also, if the transition functions are to determine the behavior
> for no-tuples and for NULL data values, they had better be able to pass
> and return NULLs cleanly; which depends on the function manager rewrite.

Are you suggesting also to change the lay-out of pg_attribute in 7.1 to
something like this and do updates for all built-in types and
aggregates?

aggname
aggowner
aggtype
aggtranstype [ n, sx, sx2 ]
agginitfunction function that does ( n = 0, sx = 0.0, sx2 = 0.0 )
aggtransfunction function that does ( n = n + 1, sx = sx + x,
sx2 = sx2 + x * x )
aggfinalizefunction function that returns (sx2 - (1/n) * sx * sx ) /
n

Might it be better for me to wait for 7.1 before implementing stddev and
variance?

> In short, I'd suggest thinking about implementing STDDEV with a
> single transition function and transition data value. You'll need
> specialized functions for it anyway, so I don't see that you're saving
> any work by proposing a third transition function. What you will need
> instead is a pg_type entry for the transition data type, but since that
> data type needn't have any operators, there's not much work needed.

OK, clear.

> > 3. I'm planning to implement this for types float4, float8 and numeric. Any
> > other types also? int[2,4,8] don't seem logical, as these would introduce
> > serious rounding errors.
>
> I'd suggest just two basic implementations, with float8 and numeric
> internal calculations respectively. Data of other numeric types can
> be type-coerced to one of these (that might even happen automatically),
> but the output would always be either float8 or numeric. (I don't think
> float4 has enough precision to generate reliable stddev numbers except
> in very narrow conditions...)

OK, only float8 and numeric.

Jeroen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeroen van Vianen <jeroen(at)design(dot)nl>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Implementing STDDEV and VARIANCE
Date: 2000-01-24 04:27:22
Message-ID: 14683.948688042@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeroen van Vianen <jeroen(at)design(dot)nl> writes:
> aggname
> aggowner
> aggtype
> aggtranstype [ n, sx, sx2 ]
> agginitfunction function that does ( n = 0, sx = 0.0, sx2 = 0.0 )
> aggtransfunction function that does ( n = n + 1, sx = sx + x,
> sx2 = sx2 + x * x )
> aggfinalizefunction function that returns (sx2 - (1/n) * sx * sx ) /
> n

Right, that's pretty much what I'm visualizing. One minor detail: there
is not an "agginitfunction", there is an "agginitvalue". So your
special datatype to hold n/sx/sx2 must have at least a typinput function
that can convert the text string held in pg_aggregate into the desired
internal form of the initial state. (At least, that's how it's done
now. Do you want to argue to change it? As long as we're opening up
the AGG design for reconsideration, we could revisit that choice too.)

> Might it be better for me to wait for 7.1 before implementing stddev and
> variance?

Well, you will need to be pretty fast on the draw if you want to get it
into 7.0 release, since we will be freezing features for beta in a week.

But I see no reason that you couldn't implement STDDEV within the
existing framework; just ignore transfn2 and do it as above. You might
have some problems with getting the desired response for zero or one
tuples, but there isn't any way to fix that within the current
framework :-(. We have to do the function manager rewrite before you
can have control over when to return a NULL. As long as you are willing
to live with that, you can have useful functionality now.

regards, tom lane


From: Chris Bitmead <chris(at)bitmead(dot)com>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Happy column dropping
Date: 2000-01-24 04:39:38
Message-ID: 388BD78A.BEEA8502@bitmead.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hannu Krosing wrote:

> I'm sure many people would appreciate it even without "preserving oid-s"
> as OID's are declared deprecated for (AFAIK) >2 years,

OIDs should not be deprecated (never heard that one before). They will
become more important if and when postgres moves more toward being an
odbms. For reasons see any book on object theory and identity.


From: The Hermit Hacker <scrappy(at)hub(dot)org>
To: Chris Bitmead <chris(at)bitmead(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: ORDBMS<->OID (Re: [HACKERS] Happy column dropping)
Date: 2000-01-24 04:54:25
Message-ID: Pine.BSF.4.21.0001240052020.79710-100000@thelab.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 24 Jan 2000, Chris Bitmead wrote:

> Hannu Krosing wrote:
>
> > I'm sure many people would appreciate it even without "preserving oid-s"
> > as OID's are declared deprecated for (AFAIK) >2 years,
>
> OIDs should not be deprecated (never heard that one before). They will
> become more important if and when postgres moves more toward being an
> odbms. For reasons see any book on object theory and identity.

You are the second person that mentioned the ODBMS<->OID tie in ... when
you say ODBMS, is that the same as ORDBMS, or we talking a different
issue?

Want to elaborate? Like, since we already have the 'core', I believe,
what does it take to continue that trend?

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy(at)hub(dot)org secondary: scrappy(at){freebsd|postgresql}.org


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Chris Bitmead <chris(at)bitmead(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Happy column dropping
Date: 2000-01-24 04:55:17
Message-ID: 200001240455.XAA24088@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Hannu Krosing wrote:
>
> > I'm sure many people would appreciate it even without "preserving oid-s"
> > as OID's are declared deprecated for (AFAIK) >2 years,
>
> OIDs should not be deprecated (never heard that one before). They will
> become more important if and when postgres moves more toward being an
> odbms. For reasons see any book on object theory and identity.

Agreed.

--
Bruce Momjian | http://www.op.net/~candle
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Don Baccus <dhogaza(at)pacifier(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeroen van Vianen <jeroen(at)design(dot)nl>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Implementing STDDEV and VARIANCE
Date: 2000-01-24 05:02:13
Message-ID: 3.0.1.32.20000123210213.010588a0@mail.pacifier.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 11:27 PM 1/23/00 -0500, Tom Lane wrote:

>Right, that's pretty much what I'm visualizing. One minor detail: there
>is not an "agginitfunction", there is an "agginitvalue". So your
>special datatype to hold n/sx/sx2 must have at least a typinput function
>that can convert the text string held in pg_aggregate into the desired
>internal form of the initial state. (At least, that's how it's done
>now. Do you want to argue to change it? As long as we're opening up
>the AGG design for reconsideration, we could revisit that choice too.)

At the moment I have a hard time visualizing an aggregate function where
a constant initializer wouldn't serve, but ... what would be the cost of
the generalization? It would only be called once per query or subquery
containing the aggregate, right?

If generalizing it can be done over a latte or mocha, perhaps it's worth
doing. If it takes as long as drinking a pint of Guiness, perhaps not :)

(I mean, let's get into meaningful metrics here!)

- Don Baccus, Portland OR <dhogaza(at)pacifier(dot)com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.


From: Chris Bitmead <chris(at)bitmead(dot)com>
To: The Hermit Hacker <scrappy(at)hub(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ORDBMS<->OID (Re: [HACKERS] Happy column dropping)
Date: 2000-01-24 05:56:37
Message-ID: 388BE995.B12BF335@bitmead.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> You are the second person that mentioned the ODBMS<->OID tie in ... when
> you say ODBMS, is that the same as ORDBMS, or we talking a different
> issue?
>
> Want to elaborate? Like, since we already have the 'core', I believe,
> what does it take to continue that trend?

Hi, yes I am very interested in this issue, and ORDBMS and ODBMS are
rather different.

A long time ago, I wrote up a web page about this issue here...
http://www.tech.com.au/postgres/

which should tell you all my thoughts on this issue. Basicly I can
contribute a LOT of work to the postgres project, but as far as hacking
the back end to achieve it - the learning curve is steep compared to my
available time. But hacking a client side stuff, I could handle.

Enjoy,
Chris Bitmead.


From: Chris Bitmead <chris(at)bitmead(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: TOAST suggestion...
Date: 2000-01-24 07:26:17
Message-ID: 388BFE98.E15320B5@bitmead.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Can I make a suggestion for toast? I don't know how much this is thought
about yet, but it seems like it would be a good idea if large object
support be re-implemented in terms of TOAST (when it is available).
Because current
large objects suck so much.

Can I suggest that the TOAST system should be implemented to allow for
random access?
So for example, to retrieve a whole data member, of course it looks
normal...

SELECT xxx from yyy;

But to retrieve a particular chunk....

SELECT PORTION(xxx, 65535, 131071) from yyy;

to retrieve the 2nd 64k chunk.

This could be useful for many sorts of apps.


From: Jeroen van Vianen <jeroen(at)design(dot)nl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Don Baccus <dhogaza(at)pacifier(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Implementing STDDEV and VARIANCE
Date: 2000-01-24 10:25:38
Message-ID: 4.2.2.20000124111157.00aa2530@mail.design.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 11:27 PM 1/23/00 -0500, Tom Lane wrote:
>Jeroen van Vianen <jeroen(at)design(dot)nl> writes:
> > aggname
> > aggowner
> > aggtype
> > aggtranstype [ n, sx, sx2 ]
> > agginitfunction function that does ( n = 0, sx = 0.0, sx2 = 0.0 )
> > aggtransfunction function that does ( n = n + 1, sx = sx + x,
> > sx2 = sx2 + x * x )
> > aggfinalizefunction function that returns (sx2 - (1/n) * sx * sx ) /
> > n
>
>Right, that's pretty much what I'm visualizing. One minor detail: there
>is not an "agginitfunction", there is an "agginitvalue". So your
>special datatype to hold n/sx/sx2 must have at least a typinput function
>that can convert the text string held in pg_aggregate into the desired
>internal form of the initial state. (At least, that's how it's done
>now. Do you want to argue to change it? As long as we're opening up
>the AGG design for reconsideration, we could revisit that choice too.)

I would suggest supplying an initfunction that initializes the datatype
that holds n/sx/sx2 so you're able to set individual members to NULL if so
desired. I also won't need to implement typinput for all required aggregate
types, one small headache less ;-)

count --> int4
min/max --> basetype
sum --> basetype
avg --> basetype, n
stddev, variance --> n, basetype, basetype

> > Might it be better for me to wait for 7.1 before implementing stddev and
> > variance?
>
>Well, you will need to be pretty fast on the draw if you want to get it
>into 7.0 release, since we will be freezing features for beta in a week.

True.

>But I see no reason that you couldn't implement STDDEV within the
>existing framework; just ignore transfn2 and do it as above. You might
>have some problems with getting the desired response for zero or one
>tuples, but there isn't any way to fix that within the current
>framework :-(. We have to do the function manager rewrite before you
>can have control over when to return a NULL. As long as you are willing
>to live with that, you can have useful functionality now.

The problem with zero or one rows is pretty important IMO if you want to
implement stddev and variance for both population and sample. You won't be
able to explain the difference in outcome if you don't do it right.

Let me wait for you to overhaul the fmgr code and do all the aggregate
stuff right in one sweep. Only thing is: how do we deal with current
user-defined aggregates?

At 09:02 PM 1/23/00 -0800, Don Baccus wrote:
>At 11:27 PM 1/23/00 -0500, Tom Lane wrote:
> >Right, that's pretty much what I'm visualizing. One minor detail: there
> >is not an "agginitfunction", there is an "agginitvalue". So your
> >special datatype to hold n/sx/sx2 must have at least a typinput function
> >that can convert the text string held in pg_aggregate into the desired
> >internal form of the initial state. (At least, that's how it's done
> >now. Do you want to argue to change it? As long as we're opening up
> >the AGG design for reconsideration, we could revisit that choice too.)
>
>At the moment I have a hard time visualizing an aggregate function where
>a constant initializer wouldn't serve, but ... what would be the cost of
>the generalization? It would only be called once per query or subquery
>containing the aggregate, right?

Initializer functions for count need to return 0, for min, max, avg, sum,
stddev and variance they need to set individual members to NULL (at least
that's how I see it now). A function returning this (with the new fmgr
code) would be very easy to implement (I hope ;-) ).

I'll hold my breath until the dust settles and we're starting 7.1.

Jeroen


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeroen van Vianen <jeroen(at)design(dot)nl>
Cc: Don Baccus <dhogaza(at)pacifier(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Implementing STDDEV and VARIANCE
Date: 2000-01-24 15:29:58
Message-ID: 25171.948727798@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeroen van Vianen <jeroen(at)design(dot)nl> writes:
>> Right, that's pretty much what I'm visualizing. One minor detail: there
>> is not an "agginitfunction", there is an "agginitvalue". So your
>> special datatype to hold n/sx/sx2 must have at least a typinput function
>> that can convert the text string held in pg_aggregate into the desired
>> internal form of the initial state. (At least, that's how it's done
>> now. Do you want to argue to change it? As long as we're opening up
>> the AGG design for reconsideration, we could revisit that choice too.)

> I would suggest supplying an initfunction that initializes the datatype
> that holds n/sx/sx2 so you're able to set individual members to NULL if so
> desired. I also won't need to implement typinput for all required aggregate
> types, one small headache less ;-)

I've been thinking about this and have come to the conclusion that using
an initial value string and a typinput function is still the right design,
compared to using a parameterless initializer function. See, that way
you have a shot at reusing the same transition type (and typinput code)
for several different aggregates with slightly different initial
condition requirements, whereas with the initializer-function approach,
you have no choice but to write a separate initializer function for each
aggregate initial condition you need.

I have been thinking that we could save some effort (at a trivial cost
in memory) by defining a datatype or two that is specifically intended
to be an aggregate transition state datatype. For example, a struct
containing three or four float8 fields and as many bool fields would
serve nicely for AVG, STDDEV, and probably some other aggregates; some
of them wouldn't use all the fields, but so what? The only support code
this datatype would need would be a typinput function to convert a
string from pg_aggregate into initial struct contents. By doing it that
way, the same datatype can support several aggregates with different
initial condition requirements, without having to write a separate
initializer function for each one.

I think the point about initializing struct contents to NULL is a red
herring. For basic C types like float8, there isn't such a thing as
NULL really; you have to have a separate flag field or count field to
tell you that you've seen no input yet. That can be initialized equally
well by a typinput function or by an initializer function. But the
typinput function gets to have a parameter taken from pg_aggregate;
an initializer function would not.

> Let me wait for you to overhaul the fmgr code and do all the aggregate
> stuff right in one sweep. Only thing is: how do we deal with current
> user-defined aggregates?

Well, a user aggregate that only used transfn1, or only transfn2, would
convert directly. This scheme would break user aggregates that used
both, which is why I'm running it up the flagpole early --- to see if
anyone complains.

regards, tom lane


From: "Nguyen, Thuan X" <thuann(at)internic(dot)net>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: unsubscribe
Date: 2000-01-24 15:44:59
Message-ID: Pine.SV4.4.10.10001241044350.26319-100000@opsmail.internic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


unsubscribe


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jeroen van Vianen <jeroen(at)design(dot)nl>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Implementing STDDEV and VARIANCE
Date: 2000-06-08 02:57:14
Message-ID: 200006080257.WAA09717@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Has this gone anywhere?

> I'd like to implement stddev and variance aggregates in Postgres. This is a
> long standing TODO item.
>
> There already has been some discussion on implementing this, see
> http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html
>
> There are two definitions for standard deviation and variance:
> _
> population variance = sigma^2 = SUM(X - X)^2 / N
>
> population stddev = sqrt(population variance)
> _
> sample variance = s^2 = SUM(X - X)^2 / (N-1)
>
> sample stddev = sqrt(sample variance)
>
> These statistics can be calculated in one pass when three variables are
> kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only
> two variables are kept.
>
> E.g. avg() is calculated as follows:
>
> sx = 0
> n = 0
> for every row {
> sx = sx + value in row // transition function 1
> n = n+1 // transition function 2
> }
> avg = sum(x) / n
>
> stddev / variance might be calculated as follows:
>
> sx = 0
> n = 0
> sx2 = 0
> for every row {
> sx = sx + value in row // transition function 1
> n = n+1 // transition function 2
> sx2 = sx2 + value in row^2 // transition function 3
> }
> var = (1/n) * (sx2 - (1/n) * sx^2) // Population
>
> or
>
> var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample
>
> and
>
> stddev = sqrt(var)
>
> I've looked through the code and the following things need to be implemented:
>
> 1. Add three columns to pg_aggregate for the additional third transition
> function.
>
> Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
> >All that you need to implement this is room to keep two running
> >sums instead of one. I haven't looked at pgsql's aggregate functions,
> >but I'd hope that the working state can be a struct not just a
> >single number.
>
> I saw no other way than adding another transition function and logic, as
> this might break user-defined aggregates (are there any around?).
>
> 2. Add logic to nodeAgg.c to execute the third transition function and
> finalize function with three rather than two parameters
> 3. Add functions f(a,b) that returns a + b^2 for selected types
> 4. Add four finalize functions to calculate the variance / stddev
> 5. Update the code for create aggregate, to include the definition of the
> third transition function
> 6. Update the documentation
>
> My questions are:
> 1. Is this the correct way to continue? What am I missing? Any errors in my
> reasoning?
> 2. I am proposing the names stddev(x) and variance(x) for population and
> samplestddev(x) and
> samplevariance(x) for sample statistics. Any comments?
> 3. I'm planning to implement this for types float4, float8 and numeric. Any
> other types also? int[2,4,8] don't seem logical, as these would introduce
> serious rounding errors.
>
> Let me know what you think,
>
>
> Jeroen
>
> ************
>

--
Bruce Momjian | http://www.op.net/~candle
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: JanWieck(at)t-online(dot)de (Jan Wieck)
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Jeroen van Vianen <jeroen(at)design(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Implementing STDDEV and VARIANCE
Date: 2000-06-09 10:39:51
Message-ID: 200006091039.MAA04186@hot.jw.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:

I created all that sometimes back. Dunno why never added it
to contrib. Will post it another day.

Jan

> Has this gone anywhere?
>
> > I'd like to implement stddev and variance aggregates in Postgres. This is a
> > long standing TODO item.
> >
> > There already has been some discussion on implementing this, see
> > http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html
> >
> > There are two definitions for standard deviation and variance:
> > _
> > population variance = sigma^2 = SUM(X - X)^2 / N
> >
> > population stddev = sqrt(population variance)
> > _
> > sample variance = s^2 = SUM(X - X)^2 / (N-1)
> >
> > sample stddev = sqrt(sample variance)
> >
> > These statistics can be calculated in one pass when three variables are
> > kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only
> > two variables are kept.
> >
> > E.g. avg() is calculated as follows:
> >
> > sx = 0
> > n = 0
> > for every row {
> > sx = sx + value in row // transition function 1
> > n = n+1 // transition function 2
> > }
> > avg = sum(x) / n
> >
> > stddev / variance might be calculated as follows:
> >
> > sx = 0
> > n = 0
> > sx2 = 0
> > for every row {
> > sx = sx + value in row // transition function 1
> > n = n+1 // transition function 2
> > sx2 = sx2 + value in row^2 // transition function 3
> > }
> > var = (1/n) * (sx2 - (1/n) * sx^2) // Population
> >
> > or
> >
> > var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample
> >
> > and
> >
> > stddev = sqrt(var)
> >
> > I've looked through the code and the following things need to be implemented:
> >
> > 1. Add three columns to pg_aggregate for the additional third transition
> > function.
> >
> > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
> > >All that you need to implement this is room to keep two running
> > >sums instead of one. I haven't looked at pgsql's aggregate functions,
> > >but I'd hope that the working state can be a struct not just a
> > >single number.
> >
> > I saw no other way than adding another transition function and logic, as
> > this might break user-defined aggregates (are there any around?).
> >
> > 2. Add logic to nodeAgg.c to execute the third transition function and
> > finalize function with three rather than two parameters
> > 3. Add functions f(a,b) that returns a + b^2 for selected types
> > 4. Add four finalize functions to calculate the variance / stddev
> > 5. Update the code for create aggregate, to include the definition of the
> > third transition function
> > 6. Update the documentation
> >
> > My questions are:
> > 1. Is this the correct way to continue? What am I missing? Any errors in my
> > reasoning?
> > 2. I am proposing the names stddev(x) and variance(x) for population and
> > samplestddev(x) and
> > samplevariance(x) for sample statistics. Any comments?
> > 3. I'm planning to implement this for types float4, float8 and numeric. Any
> > other types also? int[2,4,8] don't seem logical, as these would introduce
> > serious rounding errors.
> >
> > Let me know what you think,
> >
> >
> > Jeroen
> >
> > ************
> >
>
>
> --
> Bruce Momjian | http://www.op.net/~candle
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
> + If your life is a hard drive, | 830 Blythe Avenue
> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
>

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Jeroen van Vianen <jeroen(at)design(dot)nl>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Implementing STDDEV and VARIANCE
Date: 2000-09-30 02:27:04
Message-ID: 200009300227.WAA02615@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeroen, not sure if you were involved in this, but standard deviation is
in the current development tree, and will be released in 7.1 in a few
months.

> I'd like to implement stddev and variance aggregates in Postgres. This is a
> long standing TODO item.
>
> There already has been some discussion on implementing this, see
> http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html
>
> There are two definitions for standard deviation and variance:
> _
> population variance = sigma^2 = SUM(X - X)^2 / N
>
> population stddev = sqrt(population variance)
> _
> sample variance = s^2 = SUM(X - X)^2 / (N-1)
>
> sample stddev = sqrt(sample variance)
>
> These statistics can be calculated in one pass when three variables are
> kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only
> two variables are kept.
>
> E.g. avg() is calculated as follows:
>
> sx = 0
> n = 0
> for every row {
> sx = sx + value in row // transition function 1
> n = n+1 // transition function 2
> }
> avg = sum(x) / n
>
> stddev / variance might be calculated as follows:
>
> sx = 0
> n = 0
> sx2 = 0
> for every row {
> sx = sx + value in row // transition function 1
> n = n+1 // transition function 2
> sx2 = sx2 + value in row^2 // transition function 3
> }
> var = (1/n) * (sx2 - (1/n) * sx^2) // Population
>
> or
>
> var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample
>
> and
>
> stddev = sqrt(var)
>
> I've looked through the code and the following things need to be implemented:
>
> 1. Add three columns to pg_aggregate for the additional third transition
> function.
>
> Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
> >All that you need to implement this is room to keep two running
> >sums instead of one. I haven't looked at pgsql's aggregate functions,
> >but I'd hope that the working state can be a struct not just a
> >single number.
>
> I saw no other way than adding another transition function and logic, as
> this might break user-defined aggregates (are there any around?).
>
> 2. Add logic to nodeAgg.c to execute the third transition function and
> finalize function with three rather than two parameters
> 3. Add functions f(a,b) that returns a + b^2 for selected types
> 4. Add four finalize functions to calculate the variance / stddev
> 5. Update the code for create aggregate, to include the definition of the
> third transition function
> 6. Update the documentation
>
> My questions are:
> 1. Is this the correct way to continue? What am I missing? Any errors in my
> reasoning?
> 2. I am proposing the names stddev(x) and variance(x) for population and
> samplestddev(x) and
> samplevariance(x) for sample statistics. Any comments?
> 3. I'm planning to implement this for types float4, float8 and numeric. Any
> other types also? int[2,4,8] don't seem logical, as these would introduce
> serious rounding errors.
>
> Let me know what you think,
>
>
> Jeroen
>
> ************
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Jeroen van Vianen <jeroen(dot)van(dot)vianen(at)satama(dot)com>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Implementing STDDEV and VARIANCE
Date: 2000-10-02 07:57:18
Message-ID: 5.0.0.19.2.20001002095355.00aedeb0@ams010.satama.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 22:27 9/29/00 -0400, you wrote:
>Jeroen, not sure if you were involved in this, but standard deviation is
>in the current development tree, and will be released in 7.1 in a few
>months.

Yes, I noticed. And I'm happy to see it has been implemented without my
help, as I'm desperately running out of time to do any coding on Postgres.

Thanks,

Jeroen