Re: Measure Theoretic Data Types in Postgresql

Lists: pgsql-hackers
From: Aaron Sheldon <aaron(dot)sheldon(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Measure Theoretic Data Types in Postgresql
Date: 2012-10-11 04:37:02
Message-ID: CADyahXZBmBS4ceqjPYidpFN8pA7GzE6PaBeCBd3SWM47ZFRbMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Release 9.2 introduced the range data types, which is a much needed
extension for a large number applications, particularly in statistics, and
measurement. The natural generalization of a range is a finite partition of
the underlying abstract data space into disjoint covering ranges; for
example two partitions of the real line could be

{(-\infty, -1], (-1, 0.5], (0.5, 0.75), [0.75, 25], (25, \infty)}

and

{(-\infty, -1), [-1, 1), [1, 1], (1, 2), [2, \infty)}

To be of use we have to assign a value to each range in the partition, this
defines a simple measurable function (see Billingsley "Probability and
Measure" or Rudin "Principles of Mathematical Analysis" chapter 11, for
introductory material on Measure Theory). For the two example partitions we
could have two functions mapping to a character datatype:

f : {(-\infty, -1] -> NULL, (-1, 0.5] -> 'A', (0.5, 0.75) -> 'B', [0.75,
25] -> 'A', (25, \infty) -> 'B'}

and

g : {(-\infty, -1) -> 'A', [-1, 1) -> 'B', [1, 1] -> 'C', (1, 2) -> NULL,
[2, \infty) -> 'A'}

The simple functions can be stored using a composite datatype containing
three arrays: an n element array of ascending boundary points, an n element
array of boundary topologies either left open ')[' or right open '](', and
an n+1 element array of values assigned to each range in the partition.
With the functions stored we can define point-wise operators for summation,
product, variance, etc... and more interestingly array and string
concatenation, and count distinct; the last example would result in the new
function:

h = count_distinct(f, g) : {(-\infty, -1] -> 1, (-1, 0.5] -> 2, (0.5, 0.75)
-> 1, [0.75, 1] -> 2, (1, 25] -> 1, (25, \infty) -> 2}

Which can be generalized to point-wise aggregates over finite lists of
measurable functions. While the point-wise aggregate functions are powerful
in and of themselves, the true power would be realised when dis-aggregating
the data type into the original ranges that form the partition. This would
allow for a succinct syntax to do calculations such as finding the daily
unique patient count given the intervals of their attendance in particular
programs; a computation I encounter routinely as a statistician for a
health services provider.

I have begun sketching these ideas in off the shelf pgSQL using composite
types and constructor functions; but am far from tackling anything like
building external C extensions to handle the data types. I can set-up a
GitHub repository if anyone is interested in seeing where I am taking this.
So far I have defined the composite types (lots of them) and the
constructor functions 'indicator(lower topology, lower point, upper point,
upper topology)' which creates an indicator (0/1) function of a range, and
'measure(lower topology, lower point, upper point, upper topology, value)'
which creates a measure function of the range; after this I will tackle the
dis-aggregation functions, the operators, and finally the aggregation
functions.

I realize similar work has been done in this vain using the set of sets
idea, but this work is meant to be an implementation of formal measure
theoretic concepts, and thus the algorithmic behaviour can be analysed and
proven using the tool kit of measure theory.

Aaron Sheldon

aaron(dot)sheldon(at)gmail(dot)com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Measure Theoretic Data Types in Postgresql
Date: 2012-10-11 22:17:32
Message-ID: 5077457C.1040709@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/10/12 9:37 PM, Aaron Sheldon wrote:
> I have begun sketching these ideas in off the shelf pgSQL using composite
> types and constructor functions; but am far from tackling anything like
> building external C extensions to handle the data types. I can set-up a
> GitHub repository if anyone is interested in seeing where I am taking this.

Please do. I can't even picture it right now, and I think I ought to be
in the target audience for this feature.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Aaron Sheldon <aaron(dot)sheldon(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Measure Theoretic Data Types in Postgresql
Date: 2012-10-12 07:48:23
Message-ID: 5077CB47.90605@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11.10.2012 07:37, Aaron Sheldon wrote:
> This would allow for a succinct syntax to do calculations such as
> finding the daily unique patient count given the intervals of their
> attendance in particular programs; a computation I encounter
> routinely as a statistician for a health services provider.

Hmm. It's easy to get the count of unique patients on a particular date
with something like:

select count(distinct patient) from attendance where interval &&
'2012-10-12'::date

I guess what you're after is to get that count for a range of days, in
one query, so that the result looks something like this:

date | patients
-----------+------------
2012-10-05 | 20
2012-10-06 | 24
2012-10-07 | 30
2012-10-08 | 29

The way I think of that problem is that you need to join the dates
you're interested in with the attendance table.

select date, count (distinct patientid)
from attendance
inner join (
select '2012-10-04'::date + a AS date from generate_series(1,20) a
) dates on interval @> date
group by date;
date | count
------------+-------
2012-10-05 | 11
2012-10-06 | 27
2012-10-07 | 47
2012-10-08 | 63
2012-10-09 | 83
2012-10-10 | 95
2012-10-11 | 80
2012-10-12 | 60
2012-10-13 | 35
2012-10-14 | 13
(10 rows)

I created the test table for that with:

create table attendance (patientid int4 , interval daterange)
insert into attendance select id, daterange('2012-10-05'::date +
(random()*5)::int4, '2012-10-10'::date + (random()*5)::int4) from
generate_series(1,100) id;

So, I think the current range types already cover that use case pretty
well. I can't imagine how the proposed measure theoretic concepts would
make that simpler. Can you give some more complicated problem, perhaps,
that the proposed measure theoretic concepts would make simpler than the
current tools?

- Heikki


From: Aaron Sheldon <aaron(dot)sheldon(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Measure Theoretic Data Types in Postgresql
Date: 2012-10-12 16:43:21
Message-ID: CADyahXbNVZkYv7+r3z-76Fce-xY8Yjeucrc1v_4bZ_=dOfigSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

So the key algorithmic inefficient is the inner join on the generated
series. Worst case scenario this compares every range to every date in the
series, which for m ranges and n dates yields O(m*n) operations. The
analysts in my shop currently write queries like this for billions of
records against thousands of dates and then go and take 8 hour coffee
breaks.

However, by realizing that the bounds on the ranges have a linear ordering
one can speed this up to 0(m) using windowing functions on common table
expressions.

So what I am proposing is formalizing this optimization into a class of
data types, that will hide the implementation details.

On Fri, Oct 12, 2012 at 1:48 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 11.10.2012 07:37, Aaron Sheldon wrote:
>
>> This would allow for a succinct syntax to do calculations such as
>> finding the daily unique patient count given the intervals of their
>> attendance in particular programs; a computation I encounter
>> routinely as a statistician for a health services provider.
>>
>
> Hmm. It's easy to get the count of unique patients on a particular date
> with something like:
>
> select count(distinct patient) from attendance where interval &&
> '2012-10-12'::date
>
> I guess what you're after is to get that count for a range of days, in one
> query, so that the result looks something like this:
>
> date | patients
> -----------+------------
> 2012-10-05 | 20
> 2012-10-06 | 24
> 2012-10-07 | 30
> 2012-10-08 | 29
>
> The way I think of that problem is that you need to join the dates you're
> interested in with the attendance table.
>
> select date, count (distinct patientid)
> from attendance
> inner join (
> select '2012-10-04'::date + a AS date from generate_series(1,20) a
> ) dates on interval @> date
> group by date;
> date | count
> ------------+-------
> 2012-10-05 | 11
> 2012-10-06 | 27
> 2012-10-07 | 47
> 2012-10-08 | 63
> 2012-10-09 | 83
> 2012-10-10 | 95
> 2012-10-11 | 80
> 2012-10-12 | 60
> 2012-10-13 | 35
> 2012-10-14 | 13
> (10 rows)
>
> I created the test table for that with:
>
> create table attendance (patientid int4 , interval daterange)
> insert into attendance select id, daterange('2012-10-05'::date +
> (random()*5)::int4, '2012-10-10'::date + (random()*5)::int4) from
> generate_series(1,100) id;
>
>
> So, I think the current range types already cover that use case pretty
> well. I can't imagine how the proposed measure theoretic concepts would
> make that simpler. Can you give some more complicated problem, perhaps,
> that the proposed measure theoretic concepts would make simpler than the
> current tools?
>
> - Heikki
>

--
Aaron Sheldon

#67 Westedge Townhouses
5019 46 Ave, SW
Calgary AB, T3E 6R1

(h) 1.403.453.6316
(c) 1.403.710.9357
aaron(dot)sheldon(at)gmail(dot)com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Measure Theoretic Data Types in Postgresql
Date: 2012-10-12 17:06:41
Message-ID: 50784E21.2030104@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/12/12 12:48 AM, Heikki Linnakangas wrote:
>
> So, I think the current range types already cover that use case pretty
> well. I can't imagine how the proposed measure theoretic concepts would
> make that simpler. Can you give some more complicated problem, perhaps,
> that the proposed measure theoretic concepts would make simpler than the
> current tools?

Well, the nice thing about EXTENSIONs is that, if he builds it, people
can just try it out and see if it's useful. I suspect that the use
cases are rarified enough that this would always be an EXTENSION and not
core.

One thing I could use, for example, would be a time-keyed array, in the
form:

metrics_series ( '2012-10-17 45:22:10',10,
'1 second',15.0,15.1,16.2,NULL,15.8, 14.9, 15.1,14.2, 13.9, NULL )

WHich would allow me to do:

SELECT metric WHERE ts = '2012-10-17 45:22:14'

Without storing a timestamp with each element.

This is not a set of functionality I would expect to be generally useful
and belong in Core. But for a certain set of analytics applications, it
would be indispensible.

I expect that theoretic data types are the same way.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Nathan Boley <npboley(at)gmail(dot)com>
To: Aaron Sheldon <aaron(dot)sheldon(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Measure Theoretic Data Types in Postgresql
Date: 2012-10-12 17:41:58
Message-ID: CAHetpQRFHFBA4A2u-iG1FmNtSLcHDWOgSc8Jbs7fFbfcxd4j6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> However, by realizing that the bounds on the ranges have a linear ordering
> one can speed this up to 0(m) using windowing functions on common table
> expressions.
>
> So what I am proposing is formalizing this optimization into a class of data
> types, that will hide the implementation details.

Could this not also be handled by extending merge join to work with an
overlap operator?