Re: Multi-parameter aggregates.

Lists: pgsql-general
From: Berend Tober <btober(at)seaworthysys(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Multi-parameter aggregates.
Date: 2005-11-21 16:03:22
Message-ID: 4381EFCA.2030708@seaworthysys.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I'm interested in defining a covariance aggregate function. (As a
refresher, remember that covariance is a little bit like variance, but
is between two variables:

cov(X,Y) = <XY> - <X><Y>,

where the angular brackets in this case denote taking the averag.
Variance is a special case when X and Y are the same.)

But the whole user-defined aggregate thing is tough to get a handle on.
I'm not even sure if the direction I'm heading in below will actually
work, but as far as I got, I'm stuck on not knowing how to define a
aggregate that takes more that one variable as its argument, so its use
in SQL would look like, e.g.,

SELECT company, COVAR(year, sales) FROM annual_sales GROUP BY company;

Here is what I tried, and I'm wondering if the team here can help me
make this work (or tell me that the whole approach makes no sense,
etc.). All the DDL executes without syntactical errors until the last
function definition, and the problem is with the " BASETYPE=numeric"
line, i.e., "ERROR: AggregateCreate: function
covariance_accum(numeric[], numeric) does not exist"

CREATE TYPE public._covariance AS
(n integer, x numeric, y numeric, xy numeric);

CREATE OR REPLACE FUNCTION public.covariance_accum(_covariance, numeric,
numeric)
RETURNS _covariance AS '
BEGIN
_covariance.n := _covariance.n+1;
_covariance.x := _covariance.x+$2;
_covariance.y := _covariance.x+$3;
_covariance.xy:= _covariance.xy+($1*$2);
END;
'LANGUAGE 'plpgsql' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION public.covariance_accum(_covariance, numeric,
numeric) TO public;
COMMENT ON FUNCTION public.covariance_accum(_covariance, numeric,
numeric) IS 'covariance aggregate transition function';

-- Need to include a check for N equal zero data points

CREATE OR REPLACE FUNCTION public.numeric_covariance(_covariance)
RETURNS numeric AS '
BEGIN
(_covariance.xy/_covariance.n) -
(_covariance.x/_covariance.n)*(_covariance.y/_covariance.n);
END;
'LANGUAGE 'plpgsql' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION public.numeric_covariance(_covariance) TO public;
COMMENT ON FUNCTION public.numeric_covariance(_covariance) IS
'covariance aggregate final function';

CREATE AGGREGATE public.covariance(
BASETYPE=numeric,
SFUNC=covariance_accum,
STYPE=numeric[],
FINALFUNC=numeric_covariance,
INITCOND='{0,0,0,0}'
);

/*
--I also tried this:

CREATE AGGREGATE covariance(
BASETYPE='numeric, numeric',
SFUNC=covariance_accum,
STYPE=numeric[],
FINALFUNC=numeric_covariance,
INITCOND='{0,0,0,0}'
);

-- to no avail.
*/

Regards,
Berend

Attachment Content-Type Size
btober.vcf text/x-vcard 233 bytes

From: Michael Fuhr <mike(at)fuhr(dot)org>
To: Berend Tober <btober(at)seaworthysys(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Multi-parameter aggregates.
Date: 2005-11-21 17:06:27
Message-ID: 20051121170626.GA91161@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, Nov 21, 2005 at 11:03:22AM -0500, Berend Tober wrote:
> I'm interested in defining a covariance aggregate function. (As a
> refresher, remember that covariance is a little bit like variance, but
> is between two variables:
>
> cov(X,Y) = <XY> - <X><Y>,
>
> where the angular brackets in this case denote taking the averag.
> Variance is a special case when X and Y are the same.)
>
> But the whole user-defined aggregate thing is tough to get a handle on.
> I'm not even sure if the direction I'm heading in below will actually
> work, but as far as I got, I'm stuck on not knowing how to define a
> aggregate that takes more that one variable as its argument, so its use
> in SQL would look like, e.g.,
>
> SELECT company, COVAR(year, sales) FROM annual_sales GROUP BY company;

I think aggregates must take a single value, so the above won't
work as written. However, in PostgreSQL 8.0 or later you could
define the aggregate's base type to be a composite type and do
something like

SELECT company, COVAR(ROW(year, sales)) FROM annual_sales GROUP BY company;

You'd create the aggregate like this:

CREATE TYPE covar_state AS (...);
CREATE TYPE xypair AS (x numeric, y numeric);

CREATE FUNCTION covar_accum(covar_state, xypair) RETURNS covar_state AS ...
CREATE FUNCTION covar_final(covar_state) RETURNS numeric AS ...

CREATE AGGREGATE covar (
BASETYPE = xypair,
SFUNC = covar_accum,
FINALFUNC = covar_final,
STYPE = covar_state,
INITCOND = '(...)'
);

--
Michael Fuhr


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Berend Tober <btober(at)seaworthysys(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Multi-parameter aggregates.
Date: 2005-11-21 17:06:40
Message-ID: 2584.1132592800@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Berend Tober <btober(at)seaworthysys(dot)com> writes:
> I'm stuck on not knowing how to define a
> aggregate that takes more that one variable as its argument,

That's because there isn't any way to do that. It's on the TODO list
I believe. In the meantime, you could possibly kluge it up by
defining a composite type to be the aggregate's argument, and calling
it like
SELECT covariance(row(...)) FROM ...

regards, tom lane


From: Berend Tober <btober(at)seaworthysys(dot)com>
To: Michael Fuhr <mike(at)fuhr(dot)org>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Multi-parameter aggregates.
Date: 2005-11-21 17:33:15
Message-ID: 438204DB.7080406@seaworthysys.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Michael Fuhr wrote:

>On Mon, Nov 21, 2005 at 11:03:22AM -0500, Berend Tober wrote:
>
>
>>I'm interested in defining a covariance aggregate function.
>>
>>
>
>I think aggregates must take a single value, so the above won't
>work as written. However, in PostgreSQL 8.0 or later you could
>define the aggregate's base type to be a composite type
>

Thanks. I briefly explored something like that, using the existing POINT
data type, but didn't press it too far, pending mailling list advice.

Regards,
Berend

Attachment Content-Type Size
btober.vcf text/x-vcard 233 bytes

From: Joe Conway <mail(at)joeconway(dot)com>
To: Berend Tober <btober(at)seaworthysys(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Multi-parameter aggregates.
Date: 2005-11-21 19:18:05
Message-ID: 43821D6D.8010903@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Berend Tober wrote:
> I'm stuck on not knowing how to define a
> aggregate that takes more that one variable as its argument

This is currently unsupported.

Joe


From: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: Berend Tober <btober(at)seaworthysys(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Multi-parameter aggregates.
Date: 2005-11-21 19:50:24
Message-ID: 43822500.6090205@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Joe Conway wrote:
> Berend Tober wrote:
>
>> I'm stuck on not knowing how to define a aggregate that takes more
>> that one variable as its argument
But I guess it _could_ take an array as argument, maybe even
a record (postgresql pseudonym for what's called a structure in C).

You'd use it with the following syntax (array):
select covariance(array[x, y]) from t ;

or, for a "record"-parameter (I didn't test it - I just
guess that it should work...)
select covariance((x, y)) from t ;

greetings, Florian Pflug


From: Berend Tober <btober(at)seaworthysys(dot)com>
To: Michael Fuhr <mike(at)fuhr(dot)org>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Multi-parameter aggregates.
Date: 2005-11-25 10:13:34
Message-ID: 4386E3CE.6090604@seaworthysys.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Michael Fuhr wrote:

>On Mon, Nov 21, 2005 at 11:03:22AM -0500, Berend Tober wrote:
>
>
>>I'm interested in defining a covariance aggregate function. (As a
>>
>>
>...aggregates must take a single value, so the above won't
>work as written. However, in PostgreSQL 8.0 or later you could
>define the aggregate's base type to be a composite type ...
>
>

Thanks for you help. Here is what I came up with, using the existing
POINT data type, which when you think about it makes a lot of sense, and
even thoug having to explicitly cast a pair of columns as a point type
is a little cludgy, I'm get the analytical tool I need, regardless.

CREATE TYPE public._regression AS (
n integer,
x double precision,
y double precision,
xy double precision
);

CREATE OR REPLACE FUNCTION public.weighted_average(double precision,
double precision, double precision)
RETURNS double precision AS '
BEGIN
IF $1>1.0 THEN
RAISE EXCEPTION ''Weighted average % coefficient exceeds
unity.'', $1;
ELSIF $1<0 THEN
RAISE EXCEPTION ''Weighted average % coefficient less than
zero.'', $1;
END IF;
RETURN $1*$2 + (1.-$1)*$3;
END;
'LANGUAGE 'plpgsql' IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION public.regression_accum(_regression, point)
RETURNS _regression AS '
SELECT
$1.n+1,
weighted_average($1.n/($1.n+1.), $1.x, $2[0]),
weighted_average($1.n/($1.n+1.), $1.y, $2[1]),
weighted_average($1.n/($1.n+1.), $1.xy, $2[0]*$2[1]);
'LANGUAGE 'sql' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION public.regression_accum(_regression, point) TO
public;
COMMENT ON FUNCTION public.regression_accum(_regression, point)
IS 'regression aggregate transition function';

CREATE OR REPLACE FUNCTION public.covariance(_regression)
RETURNS double precision AS '
BEGIN
IF $1.n = 0 THEN
RAISE EXCEPTION ''No covariance data'';
END IF;
RETURN $1.xy - $1.x*$1.y;
END;
'LANGUAGE 'plpgsql' IMMUTABLE STRICT;
GRANT EXECUTE ON FUNCTION public.covariance(_regression) TO public;
COMMENT ON FUNCTION public.covariance(_regression)
IS 'cov aggregate final function';

CREATE AGGREGATE public.cov(
BASETYPE=point,
SFUNC=regression_accum,
STYPE=_regression,
FINALFUNC=covariance,
INITCOND='(0,0,0,0)'
);

-- Examples

CREATE LOCAL TEMPORARY TABLE test (
example integer,
x numeric,
y numeric
) without oids;

-- Hand calculated: cov(x,y)=0.25
INSERT INTO test VALUES (1, 0.0, 0.0);
INSERT INTO test VALUES (1, 1.0, 1.0);

-- Hand calculated: cov(x,x)=0.389, cov(y,y)=1.556, cov(x,y)=0.778
INSERT INTO test VALUES (2, 0.0, 0.0);
INSERT INTO test VALUES (2, 1.0, 2.0);
INSERT INTO test VALUES (2, 1.5, 3.0);

-- Hand calculated: cov(x,x)=0.264, cov(y,y)=0.192, cov(x,y)=0.214
INSERT INTO test VALUES (3, 0.25, 0.125);
INSERT INTO test VALUES (3, 1.0, 1.0);
INSERT INTO test VALUES (3, 1.5, 1.1);

--
http://www.quantlet.com/mdstat/scripts/mva/htmlbook/mvahtmlnode22.html:
cov(x,y) = -80.02
INSERT INTO test VALUES (4, 230, 125);
INSERT INTO test VALUES (4, 181, 99);
INSERT INTO test VALUES (4, 165, 97);
INSERT INTO test VALUES (4, 150, 115);
INSERT INTO test VALUES (4, 97, 120);
INSERT INTO test VALUES (4, 192, 100);
INSERT INTO test VALUES (4, 181, 80);
INSERT INTO test VALUES (4, 189, 90);
INSERT INTO test VALUES (4, 172, 95);
INSERT INTO test VALUES (4, 170, 125);

/*
Note that the value COV(X,X) and COV(Y,Y) do not equal output from
the native Postgresql aggregates VARIANCE(x) and VARIANCE(y),
respectively. I cannot explain why they are different, but COV()
is consistent with direct calculation.
*/

SELECT
'Example '||example AS title,
COUNT(*) AS N,
COV(POINT(x,x)) AS covXX,
COV(POINT(y,y)) AS covYY,
COV(POINT(x,y)) AS covXY,
VARIANCE(x),
VARIANCE(y)
FROM test
GROUP BY 1
ORDER BY 1;