Multi-Dimensional Histograms

Lists: pgsql-hackers
From: David Fetter <david(at)fetter(dot)org>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Multi-Dimensional Histograms
Date: 2009-06-29 16:52:00
Message-ID: 20090629165200.GC21081@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Folks,

For things like PostGIS, which will want to index in 4 dimensions
(x, y, z, t), we might want to have multi-dimensional selectivity
histograms and some way to use same.

Anybody here qualified to check out this paper on the subject, please
speak up :)

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.8475

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Nathan Boley <npboley(at)gmail(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-Dimensional Histograms
Date: 2009-06-29 20:28:01
Message-ID: 6fa3b6e20906291328jc6f33b9me6f911845ea70bfa@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> For things like PostGIS, which will want to index in 4 dimensions
> (x, y, z, t), we might want to have multi-dimensional selectivity
> histograms and some way to use same.
>

Another use case is cross column statistics.

> Anybody here qualified to check out this paper on the subject, please
> speak up :)
>

Well, I don't know if I'm qualified per se, but here goes...

It seems seems like a very reasonable approach to multidimensional histograms.

For those who haven't had time to read the paper, it describes SGRID,
a greedy algorithm for partitioning an n-dimensional space into
non-overlapping n-dimensional boxes and 'outliers' ( points that dont
fit into any boxes, but dont warrant their own boxes ). Then, they
compare their method to a regression technique, where the space is
fit into a fixed grid histogram and then the grid densities are
estimated via regression. They also compare a plane splitting
algorithm, called MHIST, where the planes are constrained to be
orthogonal to the naive dimension vectors. They dismiss singular value
decomposition and the discrete wavelet transform as being too
parametric ( which is silly, IMHO ) and briefly mention a couple
standard clustering techniques ( which are probably not appropriate
for us, given their complexity ). Unsurprisingly, it does well on the
two test sets that they consider.

It think the general idea is fine, but it would certainly need some
modification to work for postgres.

In particular,

Using the naive dimension vectors ( ie, north and east for geo data,
or column a and column b for cross-column stats ) in combination with
the box constraint will probably lead to problems. Consider, for
example, a river that goes in a straight line north-east, and a table
that stores camp sites which are mostly along the river. Because SGRID
can only create boxes with north east edges, you would end up with a
bunch of tiny box on the river where one, long north-east pointing box
would do.

We would probably want to replace the error term that SGRID uses to
determine when to create a new box by a statistical test ( maybe
homogeneity of means? ) ( also, the same holds for the outliers ).

Finally, this creates the partition but ( AFAICT ) it doesn't describe
a method for locating the histogram estimate given a point ( although
that doesn't seem too difficult ).

-Nathan


From: David Fetter <david(at)fetter(dot)org>
To: Nathan Boley <npboley(at)gmail(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-Dimensional Histograms
Date: 2009-06-29 20:44:30
Message-ID: 20090629204430.GE21081@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
> > For things like PostGIS, which will want to index in 4 dimensions
> > (x, y, z, t), we might want to have multi-dimensional selectivity
> > histograms and some way to use same.
>
> Another use case is cross column statistics.

Good to see there's more than one use case. I was hoping people would
chime in with more use cases, and here one is, fast! :)

> > Anybody here qualified to check out this paper on the subject, please
> > speak up :)
>
> Well, I don't know if I'm qualified per se, but here goes...
>
> It seems seems like a very reasonable approach to multidimensional
> histograms.
>
> For those who haven't had time to read the paper, it describes
> SGRID, a greedy algorithm for partitioning an n-dimensional space
> into non-overlapping n-dimensional boxes and 'outliers' ( points
> that dont fit into any boxes, but dont warrant their own boxes ).
> Then, they compare their method to a regression technique, where
> the space is fit into a fixed grid histogram and then the grid
> densities are estimated via regression. They also compare a plane
> splitting algorithm, called MHIST, where the planes are constrained
> to be orthogonal to the naive dimension vectors. They dismiss
> singular value decomposition and the discrete wavelet transform as
> being too parametric ( which is silly, IMHO )

Should we have a separate discussion about eigenvalues? Wavelets?

> and briefly mention a couple standard clustering techniques ( which
> are probably not appropriate for us, given their complexity ).

Good to know.

> Unsurprisingly, it does well on the two test sets that they
> consider.

Wait. You mean they might have carefully chosen data to make their
point?!?

;)

> It think the general idea is fine, but it would certainly need some
> modification to work for postgres.
>
> In particular,
>
> Using the naive dimension vectors ( ie, north and east for geo data,
> or column a and column b for cross-column stats ) in combination with
> the box constraint will probably lead to problems. Consider, for
> example, a river that goes in a straight line north-east, and a table
> that stores camp sites which are mostly along the river. Because SGRID
> can only create boxes with north east edges, you would end up with a
> bunch of tiny box on the river where one, long north-east pointing box
> would do.
>
> We would probably want to replace the error term that SGRID uses to
> determine when to create a new box by a statistical test ( maybe
> homogeneity of means? ) ( also, the same holds for the outliers ).
>
> Finally, this creates the partition but ( AFAICT ) it doesn't describe
> a method for locating the histogram estimate given a point ( although
> that doesn't seem too difficult ).

Is that "not difficult," in terms of the math that needs doing, or
"not difficult," in terms of how well PostgreSQL is already set up to
implement, or...?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Fetter <david(at)fetter(dot)org>
Cc: Nathan Boley <npboley(at)gmail(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-Dimensional Histograms
Date: 2009-06-29 22:43:35
Message-ID: 1239.1246315415@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Fetter <david(at)fetter(dot)org> writes:
> On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
>> ... They dismiss
>> singular value decomposition and the discrete wavelet transform as
>> being too parametric ( which is silly, IMHO )

> Should we have a separate discussion about eigenvalues? Wavelets?

I think it'd be a short discussion: what will you do with non-numeric
datatypes? We probably don't really want to assume anything stronger
than that the datatype has a total ordering.

regards, tom lane


From: David Fetter <david(at)fetter(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Nathan Boley <npboley(at)gmail(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-Dimensional Histograms
Date: 2009-06-29 22:56:44
Message-ID: 20090629225644.GJ21081@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 29, 2009 at 06:43:35PM -0400, Tom Lane wrote:
> David Fetter <david(at)fetter(dot)org> writes:
> > On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
> >> ... They dismiss singular value decomposition and the discrete
> >> wavelet transform as being too parametric ( which is silly, IMHO
> >> )
>
> > Should we have a separate discussion about eigenvalues? Wavelets?
>
> I think it'd be a short discussion: what will you do with
> non-numeric datatypes? We probably don't really want to assume
> anything stronger than that the datatype has a total ordering.

That sounds about like the discussion we needed unless we later decide
we need to do something special for approximations of R^n :)

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Nathan Boley <npboley(at)gmail(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-Dimensional Histograms
Date: 2009-06-29 23:49:50
Message-ID: 6fa3b6e20906291649h5f78776dx77474ea615959982@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> Finally, this creates the partition but ( AFAICT ) it doesn't describe
>> a method for locating the histogram estimate given a point ( although
>> that doesn't seem too difficult ).

> Is that "not difficult," in terms of the math that needs doing, or
> "not difficult," in terms of how well PostgreSQL is already set up to
> implement, or...?
>

I only meant that any implementation would need to address this, but I
can think of simple ways to do it ( for instance, use the fixed width
grid method, and then store a reference to all of the intersecting
boxes ). But I am sure that there are much better ways ( nested
containment lists perhaps? ).

-Nathan


From: Nathan Boley <npboley(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-Dimensional Histograms
Date: 2009-06-30 00:17:00
Message-ID: 6fa3b6e20906291717n1596ecc4qe7165d16018f4dfe@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 29, 2009 at 3:43 PM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> David Fetter <david(at)fetter(dot)org> writes:
>> On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
>>> ... They dismiss
>>> singular value decomposition and the discrete wavelet transform as
>>> being too parametric ( which is silly, IMHO )
>
>> Should we have a separate discussion about eigenvalues?  Wavelets?
>
> I think it'd be a short discussion: what will you do with non-numeric
> datatypes? We probably don't really want to assume anything stronger
> than that the datatype has a total ordering.

Well, in the general case, we could use their ranks.

At the end of the day, we cant do any dimension reduction unless the
ordering encodes some sort of useful information, and the data type
being in R^n is certainly no guarantee. Consider, for instance, the
cross correlation of zip-codes and area codes - you would really want
to order those by some geographic relation. I think that is why
cross-column stats is so hard in the general case.

That being said, for geographic data in particular, PCA or similar
could work well.

-Nathan


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Nathan Boley <npboley(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-Dimensional Histograms
Date: 2009-06-30 02:22:15
Message-ID: 603c8f070906291922s34f786c4k683860696354a456@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 29, 2009 at 8:17 PM, Nathan Boley<npboley(at)gmail(dot)com> wrote:
> On Mon, Jun 29, 2009 at 3:43 PM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> David Fetter <david(at)fetter(dot)org> writes:
>>> On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote:
>>>> ... They dismiss
>>>> singular value decomposition and the discrete wavelet transform as
>>>> being too parametric ( which is silly, IMHO )
>>
>>> Should we have a separate discussion about eigenvalues?  Wavelets?
>>
>> I think it'd be a short discussion: what will you do with non-numeric
>> datatypes? We probably don't really want to assume anything stronger
>> than that the datatype has a total ordering.
>
> Well, in the general case, we could use their ranks.
>
> At the end of the day, we cant do any dimension reduction unless the
> ordering encodes some sort of useful information, and the data type
> being in R^n is certainly no guarantee. Consider, for instance, the
> cross correlation of zip-codes and area codes - you would really want
> to order those by some geographic relation. I think that is why
> cross-column stats is so hard in the general case.
>
> That being said, for geographic data in particular, PCA or similar
> could work well.

I'm finding myself unable to follow all the terminology on this thead.
What's dimension reduction? What's PCA?

Based on my last few months of answering questions on -performance,
and my own experience, it seems like a lot of the cases that arise in
practice are those where there is a WHERE clause of the form:

colA = constA and colB op constB

...and it sometimes turns out that the subset of the data where colA =
constA has a very different distribution for colB than the data as a
whole, leading to bad plans. In many cases, it seems like colA is
storing some discrete type of thing, like a customer ID, so the
distribution of colB where colA = constA tells you nothing about the
distribution of colB where colA = constA + someSmallDeltaA. It feels
like what you might need is statistics for colB (MCVs and/or a
histogram) for certain particular values of colA. Unfortunately, in
the general case the set of values of colA for which you need these
statistics might be inconveniently large.

...Robert


From: Joshua Tolley <eggyknap(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Nathan Boley <npboley(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-Dimensional Histograms
Date: 2009-06-30 04:34:22
Message-ID: 20090630043422.GI25336@eddie
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 29, 2009 at 10:22:15PM -0400, Robert Haas wrote:
> I'm finding myself unable to follow all the terminology on this thead.
> What's dimension reduction?

For instance, ask a bunch of people a bunch of survey questions, in hopes of
predicting some value (for instance, whether or not these people will
die of a heart attack in the next three years). After waiting three years,
discover that some of the questions didn't really help you predict the
outcome. Remove them from the dataset. That's dimension reduction.

> What's PCA?

Principal Component Analysis,
http://en.wikipedia.org/wiki/Principal_components_analysis

--
Josh / eggyknap
End Point, Corp.
http://www.endpoint.com


From: Nathan Boley <npboley(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-Dimensional Histograms
Date: 2009-06-30 19:39:48
Message-ID: 6fa3b6e20906301239w1bf3b1cer88545747d041c3c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I'm finding myself unable to follow all the terminology on this thead.
>  What's dimension reduction?  What's PCA?

( Sorry for the jargon - thanks Josh )

> It feels like what you might need is statistics for colB (MCVs and/or a
> histogram) for certain particular values of colA.

Certainly - this is exactly what a multi-dimensional histogram would store.

> Unfortunately, in
> the general case the set of values of colA for which you need these
> statistics might be inconveniently large.
>

Which is why, in the general case, we need some sort of dimension reduction.

If we could say that, for instance, the distribution of colB is
roughly the same for all values of colA less than 100, and it is
roughly the same for all values of colA >= 100, then we would have
effectively reduced the dimension from ndistinct(colB)*ndistinct(colA)
to 2*ndistinct(colB).

The multidimensional histogram in the above paper is somewhat akin to
what you suggested and I just described - it attempts to select
contiguous regions that are "the same". PCA is a much different
approach, but it is still, broadly, a dimension reduction technique.

> At the end of the day, we cant do any dimension reduction unless the
> ordering encodes some sort of useful information, and the data type
> being in R^n is certainly no guarantee. Consider, for instance, the
> cross correlation of zip-codes and area codes - you would really want
> to order those by some geographic relation. I think that is why
> cross-column stats is so hard in the general case.

To clarify my response to Tom, directly above, if as in Robert's
example, all of the values in colA < 100 were different than all of
the values > 100 with respect to colB, then it is easy to represent
the structure in something manageable. Ie, we store two histograms for
colB: 1 for colA > 100, one for colA < 100. However, if there are the
same two classes in colB ( lets say C1 and C2 ) but rather than C1
being associated with values in colA < 100, it is associated with 100
completely random values in colA ( ie 1, 22, 47, 89, 91, 92, 107, ...
) there is not a whole lot we can do besides storing a list of all of
those values. We could maybe use the ratio of classes to improve the
average plan choice, but you would still get a lot of bad plans.

Could anyone provide a concrete example ( or better yet a data set )
where this occurs?

-Nathan


From: Gregory Maxwell <gmaxwell(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Nathan Boley <npboley(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-Dimensional Histograms
Date: 2009-07-05 08:15:22
Message-ID: e692861c0907050115l6123f2b4n6ee918c64cbfc7e0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 29, 2009 at 10:22 PM, Robert Haas<robertmhaas(at)gmail(dot)com> wrote:
> I'm finding myself unable to follow all the terminology on this thead.
>  What's dimension reduction?  What's PCA?
[snip]

Imagine you have a dataset with two variables, say height in inches
and age in years. For tue purpose of discussion lets pretend for a
moment that all the people in your sample have height the same as
their age.

You could create a 2d histogram of your data:

|0000000200000000
|0000006000000000
a|0000030000000000
g|0000400000000000
e|0003000000000000
|0010000000000000
|0100000000000000
|----------------
height

You could store this 2d histogram as is and use it for all the things
you'd use histograms for.... or you could make an observation of the
structure and apply a rotation and flattening of the data and convert
it to a 1d histogram

[0113426200...] which is far more compact.

Often data has significant correlation, so it's often possible to
reduce the dimensionality without reducing the selectivity of the
histogram greatly.

This becomes tremendously important as the number of dimensions goes
up because the volume of a N dimensional space increases incredibly
fast as the number of dimensions increase.

PCA is used as one method of dimensionality reduction. In PCA you find
a linear transformation (scaling, rotation) of the data that aligns
the data so that the axis lines cut through the data-space in the
orientations with the greatest variance.

I have no clue how you would apply PCA to postgresql histograms, since
to build the PCA transform you need to do some non-trivial operations
with the data. Perhaps PCA could be done on a random sample of a
table, then that transformation could be stored and used to compute
the histograms. I'm sure there has been a lot of research on this.