Re: Cross-column statistics revisited

Lists: pgsql-hackers
From: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Cross-column statistics revisited
Date: 2008-10-15 10:53:10
Message-ID: e7e0a2570810150353h49826e8bh9dde3676109a7574@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been interested in what it would take to start tracking
cross-column statistics. A review of the mailing lists as linked from
the TODO item on the subject [1] suggests the following concerns:

1) What information exactly would be tracked?
2) How would it be kept from exploding in size?
3) For which combinations of columns would statistics be kept?

The major concern in #1 seemed to be that the most suitable form for
keeping most common value lists, histograms, etc. is in an array, and
at the time of the posts I read, arrays of composite types weren't
possible. This seems much less of a concern now -- perhaps in greatest
part because a test I just did against a recent 8.4devel sure makes it
look like stats on composite type columns aren't even kept. The most
straightforward is that we'd keep a simple multi-dimensional
histogram, but that leads to a discussion of #2.

#2 is an interesting problem, and possibly the most difficult from a
theoretical perspective. Provided a fair expectation that the results
would be useful, I'd like to play with various forms of statistical
regressions to compress large cross-column histograms. But I'm not
sure I want to spend the time if there are other serious issues with
the whole idea of cross-column stats. Another possibility involves not
storing a histogram at all, but instead storing an array of quantiles
for each column. In other words, if S represents the statistics target
of a column, we'd have an array A of S+1 entries of the type in
question, where the beginning and end of the array are the minimum and
maximum values in the column, and the range between any two
consecutive array elements A[n] and A[n+1] contains 1/S of the total
entries in the column. For a uniformly distributed column, these array
elements would be evenly spaced across the total range of the column;
the difference between consecutive array elements would indicate
deviations from this distribution. Unfortunately this only works if
there's a valid distance metric for the data type in question.

In the archives, #3 was raised frequently as a concern -- obviously
tracking stats on all permutations of the columns in a database is a
non-starter. However there seemed to be little argument over sensible
defaults, namely that columns involved in a multi-column index would
have cross-column statistics kept automatically, and the user would be
allowed to instruct the server to keep stats on other arbitrary groups
of columns. There was some discussion about the fact that the user
would have to be smart about how (s)he used this feature, but there
are enough other potential foot guns in the system of similar
magnitude that worrying too much about that seems of little use.

I bring all this up because I'm interested in looking at how this
might be done, and perhaps doing it if 1) time permits, and 2) someone
else doesn't beat me to it. Comments, anyone? Am I missing something
obvious/serious/etc.? Thanks in advance.

- Josh / eggyknap

[1] "Allow accurate statistics to be collected on indexes with more
than one column or expression indexes, perhaps using per-index
statistics", http://wiki.postgresql.org/wiki/Todo


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-15 13:51:37
Message-ID: 871vyiarva.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Joshua Tolley" <eggyknap(at)gmail(dot)com> writes:

> I've been interested in what it would take to start tracking
> cross-column statistics. A review of the mailing lists as linked from
> the TODO item on the subject [1] suggests the following concerns:
>
> 1) What information exactly would be tracked?
> 2) How would it be kept from exploding in size?
> 3) For which combinations of columns would statistics be kept?

I think then you have

4) How would we form estimates from these stats

> The major concern in #1 seemed to be that the most suitable form for
> keeping most common value lists, histograms, etc. is in an array, and
> at the time of the posts I read, arrays of composite types weren't
> possible. This seems much less of a concern now -- perhaps in greatest
> part because a test I just did against a recent 8.4devel sure makes it
> look like stats on composite type columns aren't even kept. The most
> straightforward is that we'd keep a simple multi-dimensional
> histogram, but that leads to a discussion of #2.

"multi-dimensional histogram" isn't such a simple concept, at least not to me.

Histograms aren't a bar chart of equal widths and various heights like I was
taught in school. They're actually bars of various widths arranged such that
they all of the same heights.

It's not clear how to extend that concept into two dimensions. I imagine
there's research on this though. What do the GIST statistics functions store?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!


From: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-15 16:13:34
Message-ID: e7e0a2570810150913x16c4fa4eic06a17829def5968@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 15, 2008 at 7:51 AM, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> "Joshua Tolley" <eggyknap(at)gmail(dot)com> writes:
>
>> I've been interested in what it would take to start tracking
>> cross-column statistics. A review of the mailing lists as linked from
>> the TODO item on the subject [1] suggests the following concerns:
>>
>> 1) What information exactly would be tracked?
>> 2) How would it be kept from exploding in size?
>> 3) For which combinations of columns would statistics be kept?
>
> I think then you have
>
> 4) How would we form estimates from these stats

That too, but see below.

>> The major concern in #1 seemed to be that the most suitable form for
>> keeping most common value lists, histograms, etc. is in an array, and
>> at the time of the posts I read, arrays of composite types weren't
>> possible. This seems much less of a concern now -- perhaps in greatest
>> part because a test I just did against a recent 8.4devel sure makes it
>> look like stats on composite type columns aren't even kept. The most
>> straightforward is that we'd keep a simple multi-dimensional
>> histogram, but that leads to a discussion of #2.
>
> "multi-dimensional histogram" isn't such a simple concept, at least not to me.
>
> Histograms aren't a bar chart of equal widths and various heights like I was
> taught in school. They're actually bars of various widths arranged such that
> they all of the same heights.

That depends on whose definition you've got in mind. Wikipedia's
version (for whatever that's worth) defines it as variable width
columns of not-necessarily-uniform heights, where the area of each bar
is the frequency. The default behavior of the hist() function in the R
statistics package is uniform bar widths, which generates complaints
from purists. Semantics aside, it looks like pgsql's stats stuff
(which I realize I'll need to get to know lots better to undertake
something like this) uses a list of quantiles, which still only makes
sense, as I see it, when a distance measurement is available for the
data type. The multidimensional case might do better with a frequency
count, where we'd store a matrix/cube/etc. with uniformly-sized units,
probably compressed via some regression function. That said, a
multidimensional quantile list would also be possible, compressed
similarly.

Provided a frequency chart or quantile set, it seems estimates can be
made just as they are in one dimension, by finding the right value in
the quantile or frequency matrix.

> It's not clear how to extend that concept into two dimensions. I imagine
> there's research on this though. What do the GIST statistics functions store?

No idea.

- Josh / eggyknap


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Joshua Tolley <eggyknap(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-16 17:11:26
Message-ID: 20081016171126.GB19967@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 15, 2008 at 04:53:10AM -0600, Joshua Tolley wrote:
> I've been interested in what it would take to start tracking
> cross-column statistics. A review of the mailing lists as linked from
> the TODO item on the subject [1] suggests the following concerns:
>
> 1) What information exactly would be tracked?
> 2) How would it be kept from exploding in size?
> 3) For which combinations of columns would statistics be kept?

I think you need to go a step back: how are you going to use this data?
Whatever structure you choose the eventual goal you take a discription
of the column (a,b) and take a clause like 'a < 5' and be able to
generate an estimate of the distribution of b.

Secondly, people arn't going to ask for multi-column stats on column
that arn't correlated in some way. So you need to work out what kinds
of correlation people are interested in and see how you can store them.

One potential use case is the (startdate,enddate) columns. Here what
you want to detect somehow that the distribution of (enddate-startdate)
is constant.

I think the real question is: what other kinds of correlation might
people be interested in representing?

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Joshua Tolley <eggyknap(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-16 17:20:30
Message-ID: 17221.1224177630@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> I think you need to go a step back: how are you going to use this data?

The fundamental issue as the planner sees it is not having to assume
independence of WHERE clauses. For instance, given

WHERE a < 5 AND b > 10

our current approach is to estimate the fraction of rows with a < 5
(using stats for a), likewise estimate the fraction with b > 10
(using stats for b), and then multiply these fractions together.
This is correct if a and b are independent, but can be very bad if
they aren't. So if we had joint statistics on a and b, we'd want to
somehow match that up to clauses for a and b and properly derive
the joint probability.

(I'm not certain of how to do that efficiently, even if we had the
right stats :-()

regards, tom lane


From: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Joshua Tolley <eggyknap(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-16 17:31:48
Message-ID: B71B9E9E-3F8D-48B2-9D99-A342AB043322@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

[sorry for top osting - dam phone]

It's pretty straightforward to to a chi-squared test on all the pairs.
But that tells you that the product is more likely to be wrong. It
doesn't tell you whether it's going to be too high or too low...

greg

On 16 Oct 2008, at 07:20 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
>> I think you need to go a step back: how are you going to use this
>> data?
>
> The fundamental issue as the planner sees it is not having to assume
> independence of WHERE clauses. For instance, given
>
> WHERE a < 5 AND b > 10
>
> our current approach is to estimate the fraction of rows with a < 5
> (using stats for a), likewise estimate the fraction with b > 10
> (using stats for b), and then multiply these fractions together.
> This is correct if a and b are independent, but can be very bad if
> they aren't. So if we had joint statistics on a and b, we'd want to
> somehow match that up to clauses for a and b and properly derive
> the joint probability.
>
> (I'm not certain of how to do that efficiently, even if we had the
> right stats :-()
>
> regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Cc: "Joshua Tolley" <eggyknap(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-16 17:34:59
Message-ID: 603c8f070810161034o8333bf3ka08a3230578022f6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I think the real question is: what other kinds of correlation might
> people be interested in representing?

Yes, or to phrase that another way: What kinds of queries are being
poorly optimized now and why?

I suspect that a lot of the correlations people care about are
extreme. For example, it's fairly common for me to have a table where
column B is only used at all for certain values of column A. Like,
atm_machine_id is usually or always NULL unless transaction_type is
ATM, or something. So a clause of the form transaction_type = 'ATM'
and atm_machine_id < 10000 looks more selective than it really is
(because the first half is redundant).

The other half of this is that bad selectivity estimates only matter
if they're bad enough to change the plan, and I'm not sure whether
cases like this are actually a problem in practice.

...Robert


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Joshua Tolley <eggyknap(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-16 17:50:49
Message-ID: 20081016175049.GC19967@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 16, 2008 at 01:34:59PM -0400, Robert Haas wrote:
> I suspect that a lot of the correlations people care about are
> extreme. For example, it's fairly common for me to have a table where
> column B is only used at all for certain values of column A. Like,
> atm_machine_id is usually or always NULL unless transaction_type is
> ATM, or something. So a clause of the form transaction_type = 'ATM'
> and atm_machine_id < 10000 looks more selective than it really is
> (because the first half is redundant).

That case is easily done by simply considering the indexed values of a
column with a partial index. This should be fairly easy to do I think.

It might be worthwhile someone trawling through the archives looking
for examples where we estimate the correlation wrong.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Joshua Tolley <eggyknap(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-16 20:35:25
Message-ID: 48F7A58D.2090303@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
>> I think the real question is: what other kinds of correlation might
>> people be interested in representing?
>
> Yes, or to phrase that another way: What kinds of queries are being
> poorly optimized now and why?

The one that affects our largest tables are ones where we
have an address (or other geo-data) clustered by zip, but
with other columns (city, county, state, school-zone, police
beat, etc) used in queries.

Postgres considers those unclustered (correlation 0 in the stats),
despite all rows for a given value residing on the same few pages.

I could imagine that this could be handled by either some cross-column
correlation (each zip has only 1-2 cities); or by an enhanced
single-column statistic (even though cities aren't sorted alphabetically,
all rows on a page tend to refer to the same city).


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Joshua Tolley <eggyknap(at)gmail(dot)com>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-16 20:54:57
Message-ID: 200810161354.58098.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

> (I'm not certain of how to do that efficiently, even if we had the
> right stats :-()

I was actually talking to someone about this at pgWest. Apparently there's
a fair amount of academic algorithms devoted to this topic. Josh, do you
remember who was talking about this?

--
--Josh

Josh Berkus
PostgreSQL
San Francisco


From: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
To: josh(at)agliodbs(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-16 21:34:26
Message-ID: e7e0a2570810161434t64a6345ftd7bc3de120738c0a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 16, 2008 at 2:54 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Tom,
>
>> (I'm not certain of how to do that efficiently, even if we had the
>> right stats :-()
>
> I was actually talking to someone about this at pgWest. Apparently there's
> a fair amount of academic algorithms devoted to this topic. Josh, do you
> remember who was talking about this?

Actually, it was me :) My biggest concern at the time was finding ways
to compress the correlation data, on the apparently fairly tenuous
assumption that we'd somehow be able to make use of it. As it turns
out, the particular set of algorithms I had in mind don't compress
anything, but there are other methods that do.

Most of the comments on this thread have centered around the questions
of "what we'd store" and "how we'd use it", which might be better
phrased as, "The database assumes columns are independent, but we know
that's not always true. Does this cause enough problems to make it
worth fixing? How might we fix it?" I have to admit an inability to
show that it causes problems, though Neil Conway pointed me to some
literature[1] on the subject I've not yet had time to go through. My
basic idea is that if we have a cross-column frequency count, it will
help us get better plans, but I don't know the internals enough to
have details further than that.

- Josh / eggyknap

[1] http://www.cs.umd.edu/~amol/papers/paper-dep.pdf


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-16 22:12:18
Message-ID: 200810161512.18869.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Yes, or to phrase that another way: What kinds of queries are being
> poorly optimized now and why?

Well, we have two different correlation problems. One is the problem of
dependant correlation, such as the 1.0 correlation of ZIP and CITY fields
as a common problem. This could in fact be fixed, I believe, via a linear
math calculation based on the sampled level of correlation, assuming we
have enough samples. And it's really only an issue if the correlation is
> 0.5.

The second type of correlation issue we have is correlating values in a
parent table with *rows* in child table (i.e. FK joins). Currently, the
planner assumes that all rows in the child table are evenly distributed
against keys in the parent table. But many real-world databases have this
kind of problem:

A B
1 10000 rows
2 10000 rows
3 1000 rows
4 .. 1000 0 to 1 rows

For queries which cover values between 4..1000 on A, the misestimate won't
be much of a real execution problem. But for values 1,2,3, the query will
bomb.

> The other half of this is that bad selectivity estimates only matter
> if they're bad enough to change the plan, and I'm not sure whether
> cases like this are actually a problem in practice.

My experience is that any estimate which is more than 5x wrong (i.e. < .2
or > 5.0) usually causes problems, and 3x sometimes causes problems.

--
--Josh

Josh Berkus
PostgreSQL
San Francisco


From: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To: josh(at)agliodbs(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-16 22:20:58
Message-ID: FD242D4C-AAB8-4F62-8D61-3DB904A348F1@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Correlation is the wrong tool. In fact zip codes and city have nearly
zero correlation. Zip codes near 00000 are no more likely to be in
cities starting with A than Z.

Even if you use an appropriate tool I'm not clear what to do with the
information. Consider the case of WHERE city='boston' and zip='02139'
and another query with WHERE city='boston' and zip='90210'. One will
produce many more records than the separate histograms would predict
and the other would produce zero. How do you determine which category
a given pair of constants falls into?

Separately you mention cross-table stats - but that' a whole other
kettle of worms. I'm not sure which is easier but let's do one at a
time?

greg

On 17 Oct 2008, at 12:12 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

>
>> Yes, or to phrase that another way: What kinds of queries are being
>> poorly optimized now and why?
>
> Well, we have two different correlation problems. One is the
> problem of
> dependant correlation, such as the 1.0 correlation of ZIP and CITY
> fields
> as a common problem. This could in fact be fixed, I believe, via a
> linear
> math calculation based on the sampled level of correlation, assuming
> we
> have enough samples. And it's really only an issue if the
> correlation is
>> 0.5.
>
> The second type of correlation issue we have is correlating values
> in a
> parent table with *rows* in child table (i.e. FK joins). Currently,
> the
> planner assumes that all rows in the child table are evenly
> distributed
> against keys in the parent table. But many real-world databases
> have this
> kind of problem:
>
> A B
> 1 10000 rows
> 2 10000 rows
> 3 1000 rows
> 4 .. 1000 0 to 1 rows
>
> For queries which cover values between 4..1000 on A, the misestimate
> won't
> be much of a real execution problem. But for values 1,2,3, the
> query will
> bomb.
>
>> The other half of this is that bad selectivity estimates only matter
>> if they're bad enough to change the plan, and I'm not sure whether
>> cases like this are actually a problem in practice.
>
> My experience is that any estimate which is more than 5x wrong (i.e.
> < .2
> or > 5.0) usually causes problems, and 3x sometimes causes problems.
>
> --
> --Josh
>
> Josh Berkus
> PostgreSQL
> San Francisco
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: josh(at)agliodbs(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-16 23:29:27
Message-ID: 48F7CE57.10004@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus wrote:
>> Yes, or to phrase that another way: What kinds of queries are being
>> poorly optimized now and why?
>
> Well, we have two different correlation problems. One is the problem of
> dependant correlation, such as the 1.0 correlation of ZIP and CITY fields
> as a common problem. This could in fact be fixed, I believe, via a linear
> math calculation based on the sampled level of correlation, assuming we
> have enough samples. And it's really only an issue if the correlation is
>> 0.5.

I'd note that this can be an issue even without 2 columns involved.

I've seen a number of tables where the data is loaded in batches
so similar-values from a batch tend to be packed into relatively few pages.

Thinks a database for a retailer that nightly aggregates data from
each of many stores. Each incoming batch inserts the store's data
into tightly packed disk pages where most all rows on the page are for
that store. But those pages are interspersed with pages from other
stores.

I think I like the ideas Greg Stark had a couple years ago:
http://archives.postgresql.org/pgsql-hackers/2006-09/msg01040.php
"...sort the sampled values by value
and count up the average number of distinct blocks per value.... Or
perhaps we need a second histogram where the quantities are of
distinct pages rather than total records.... We might also need a
separate "average number of n-block spans per value"
since those seem to me to lead more directly to values like "blocks
that need to be read".


From: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
To: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 00:00:20
Message-ID: BBCB1CB3-976C-4572-B1D8-E0F4D0E3345C@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This is yet another issue entirely. This is about estimating how much
io will be random io if we do an index order scan. Correlation is a
passable tool for this but we might be able to do better.

But it has nothing to do with the cross-column stats problem.

greg

On 17 Oct 2008, at 01:29 AM, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
wrote:

> Josh Berkus wrote:
>>> Yes, or to phrase that another way: What kinds of queries are being
>>> poorly optimized now and why?
>> Well, we have two different correlation problems. One is the
>> problem of dependant correlation, such as the 1.0 correlation of
>> ZIP and CITY fields as a common problem. This could in fact be
>> fixed, I believe, via a linear math calculation based on the
>> sampled level of correlation, assuming we have enough samples. And
>> it's really only an issue if the correlation is
>>> 0.5.
>
> I'd note that this can be an issue even without 2 columns involved.
>
> I've seen a number of tables where the data is loaded in batches
> so similar-values from a batch tend to be packed into relatively few
> pages.
>
> Thinks a database for a retailer that nightly aggregates data from
> each of many stores. Each incoming batch inserts the store's data
> into tightly packed disk pages where most all rows on the page are for
> that store. But those pages are interspersed with pages from other
> stores.
>
> I think I like the ideas Greg Stark had a couple years ago:
> http://archives.postgresql.org/pgsql-hackers/2006-09/msg01040.php
> "...sort the sampled values by value
> and count up the average number of distinct blocks per value.... Or
> perhaps we need a second histogram where the quantities are of
> distinct pages rather than total records.... We might also need a
> separate "average number of n-block spans per value"
> since those seem to me to lead more directly to values like "blocks
> that need to be read".
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 00:32:38
Message-ID: 4073.1224203558@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Joshua Tolley" <eggyknap(at)gmail(dot)com> writes:
> Most of the comments on this thread have centered around the questions
> of "what we'd store" and "how we'd use it", which might be better
> phrased as, "The database assumes columns are independent, but we know
> that's not always true. Does this cause enough problems to make it
> worth fixing? How might we fix it?" I have to admit an inability to
> show that it causes problems,

Any small amount of trolling in our archives will turn up plenty of
examples.

It appears to me that a lot of people in this thread are confusing
correlation in the sense of statistical correlation between two
variables with correlation in the sense of how well physically-ordered
a column is. (The latter is actually the same kind of animal, but
always taking one of the two variables to be physical position.)
A bad estimate for physical-position correlation has only limited
impact, as Josh B said upthread; but the other case leads to very
bad rowcount estimates which have *huge* impact on plan choices.

regards, tom lane


From: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 01:30:43
Message-ID: e7e0a2570810161830y6e7939dby24328e7ae0808f73@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 16, 2008 at 6:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> It appears to me that a lot of people in this thread are confusing
> correlation in the sense of statistical correlation between two
> variables with correlation in the sense of how well physically-ordered
> a column is.

For what it's worth, neither version of correlation was what I had in
mind. Statistical correlation between two variables is a single
number, is fairly easy to calculate, and probably wouldn't help query
plans much at all. I'm more interested in a more complex data
gathering. The data model I have in mind (which I note I have *not*
proven to actually help a large number of query plans -- that's
obviously an important part of what I'd need to do in all this)
involves instead a matrix of frequency counts. Right now our
"histogram" values are really quantiles; the statistics_target T for a
column determines a number of quantiles we'll keep track of, and we
grab values from into an ordered list L so that approximately 1/T of
the entries in that column fall between values L[n] and L[n+1]. I'm
thinking that multicolumn statistics would instead divide the range of
each column up into T equally sized segments, to form in the
two-column case a matrix, where the values of the matrix are frequency
counts -- the number of rows whose values for each column fall within
the particular segments of their respective ranges represented by the
boundaries of that cell in the matrix. I just realized while writing
this that this might not extend to situations where the two columns
are from different tables and don't necessarily have the same row
count, but I'll have to think about that.

Anyway, the size of such a matrix would be exponential in T, and
cross-column statistics involving just a few columns could easily
involve millions of values, for fairly normal statistics_targets.
That's where the compression ideas come in to play. This would
obviously need a fair bit of testing, but it's certainly conceivable
that modern regression techniques could reduce that frequency matrix
to a set of functions with a small number of parameters. Whether that
would result in values the planner can look up for a given set of
columns without spending more time than it's worth is another question
that will need exploring.

I started this thread knowing that past discussions have posed the
following questions:

1. What sorts of cross-column data can we really use?
2. Can we collect that information?
3. How do we know what columns to track?

For what it's worth, my original question was whether anyone had
concerns beyond these, and I think that has been fairly well answered
in this thread.

- Josh / eggyknap


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 02:38:44
Message-ID: 6877.1224211124@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Joshua Tolley" <eggyknap(at)gmail(dot)com> writes:
> For what it's worth, neither version of correlation was what I had in
> mind. Statistical correlation between two variables is a single
> number, is fairly easy to calculate, and probably wouldn't help query
> plans much at all. I'm more interested in a more complex data
> gathering. The data model I have in mind (which I note I have *not*
> proven to actually help a large number of query plans -- that's
> obviously an important part of what I'd need to do in all this)
> involves instead a matrix of frequency counts.

Oh, I see the distinction you're trying to draw. Agreed on both points:
a simple correlation number is pretty useless to us, and we don't have
hard evidence that a histogram-matrix will solve the problem. However,
we do know that one-dimensional histograms of the sort currently
collected by ANALYZE work pretty well (if they're detailed enough).
It seems reasonable to me to extrapolate that same concept to two or
more dimensions. The issue then becomes that a "detailed enough"
matrix might be impracticably bulky, so you need some kind of lossy
compression, and we don't have hard evidence about how well that will
work. Nonetheless the road map seems clear enough to me.

> Right now our
> "histogram" values are really quantiles; the statistics_target T for a
> column determines a number of quantiles we'll keep track of, and we
> grab values from into an ordered list L so that approximately 1/T of
> the entries in that column fall between values L[n] and L[n+1]. I'm
> thinking that multicolumn statistics would instead divide the range of
> each column up into T equally sized segments,

Why would you not use the same histogram bin bounds derived for the
scalar stats (along each axis of the matrix, of course)? This seems to
me to be arbitrarily replacing something proven to work with something
not proven. Also, the above forces you to invent a concept of "equally
sized" ranges, which is going to be pretty bogus for a lot of datatypes.

regards, tom lane


From: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 03:17:03
Message-ID: e7e0a2570810162017q5d778b87n600c21874162cdd4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 16, 2008 at 8:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Joshua Tolley" <eggyknap(at)gmail(dot)com> writes:
>> For what it's worth, neither version of correlation was what I had in
>> mind. Statistical correlation between two variables is a single
>> number, is fairly easy to calculate, and probably wouldn't help query
>> plans much at all. I'm more interested in a more complex data
>> gathering. The data model I have in mind (which I note I have *not*
>> proven to actually help a large number of query plans -- that's
>> obviously an important part of what I'd need to do in all this)
>> involves instead a matrix of frequency counts.
>
> Oh, I see the distinction you're trying to draw. Agreed on both points:
> a simple correlation number is pretty useless to us, and we don't have
> hard evidence that a histogram-matrix will solve the problem. However,
> we do know that one-dimensional histograms of the sort currently
> collected by ANALYZE work pretty well (if they're detailed enough).
> It seems reasonable to me to extrapolate that same concept to two or
> more dimensions. The issue then becomes that a "detailed enough"
> matrix might be impracticably bulky, so you need some kind of lossy
> compression, and we don't have hard evidence about how well that will
> work. Nonetheless the road map seems clear enough to me.

Oh goodie -- I feel so validated :)

>
>> Right now our
>> "histogram" values are really quantiles; the statistics_target T for a
>> column determines a number of quantiles we'll keep track of, and we
>> grab values from into an ordered list L so that approximately 1/T of
>> the entries in that column fall between values L[n] and L[n+1]. I'm
>> thinking that multicolumn statistics would instead divide the range of
>> each column up into T equally sized segments,
>
> Why would you not use the same histogram bin bounds derived for the
> scalar stats (along each axis of the matrix, of course)? This seems to
> me to be arbitrarily replacing something proven to work with something
> not proven. Also, the above forces you to invent a concept of "equally
> sized" ranges, which is going to be pretty bogus for a lot of datatypes.

Because I'm trying to picture geometrically how this might work for
the two-column case, and hoping to extend that to more dimensions, and
am finding that picturing a quantile-based system like the one we have
now in multiple dimensions is difficult. I believe those are the same
difficulties Gregory Stark mentioned having in his first post in this
thread. But of course that's an excellent point, that what we do now
is proven. I'm not sure which problem will be harder to solve -- the
weird geometry or the "equally sized ranges" for data types where that
makes no sense.

One option that came to mind recently would be to have statistics for
a subset of the rows in a single column A, where that subset is
defined by conditions on some other column B with a defined
relationship to A (for instance, that they're in the same table).
There are all kinds of complexities possible with that idea, but one
bright spot is that the values in the catalog and the way the planner
makes use of them would stay essentially the same as they are now.
That one's interesting to think about, but probably too complex for
real use. Anyway, I'll keep pondering.

- Josh / eggyknap


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Joshua Tolley <eggyknap(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 06:24:21
Message-ID: 20081017062421.GA1443@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 16, 2008 at 09:17:03PM -0600, Joshua Tolley wrote:
> Because I'm trying to picture geometrically how this might work for
> the two-column case, and hoping to extend that to more dimensions, and
> am finding that picturing a quantile-based system like the one we have
> now in multiple dimensions is difficult.

Just a note: using a multidimensional histograms will work well for the
cases like (startdate,enddate) where the histogram will show a
clustering of values along the diagonal. But it will fail for the case
(zipcode,state) where one implies the other. Histogram-wise you're not
going to see any correlation at all but what you want to know is:

count(distinct zipcode,state) = count(distinct zipcode)

So you might need to think about storing/searching for different kinds
of correlation.

Secondly, my feeling about multidimensional histograms is that you're
not going to need the matrix to have 100 bins along each axis, but that
it'll be enough to have 1000 bins total. The cases where we get it
wrong enough for people to notice will probably be the same cases where
the histogram will have noticable variation even for a small number of
bins.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Cc: josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 06:41:40
Message-ID: 20081017064140.GB1443@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 17, 2008 at 12:20:58AM +0200, Greg Stark wrote:
> Correlation is the wrong tool. In fact zip codes and city have nearly
> zero correlation. Zip codes near 00000 are no more likely to be in
> cities starting with A than Z.

I think we need to define our terms better. In terms of linear
correlation you are correct. However, you can define invertable mappings
from zip codes and cities onto the integers which will then have an
almost perfect correlation.

According to a paper I found this is related to the "principle of
maximum entropy". The fact that you can't determine such functions
easily in practice doesn't change the fact that zip codes and city
names are highly correlated.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 08:00:42
Message-ID: 8763nrhcr9.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:

> On Fri, Oct 17, 2008 at 12:20:58AM +0200, Greg Stark wrote:
>> Correlation is the wrong tool. In fact zip codes and city have nearly
>> zero correlation. Zip codes near 00000 are no more likely to be in
>> cities starting with A than Z.
>
> I think we need to define our terms better. In terms of linear
> correlation you are correct. However, you can define invertable mappings
> from zip codes and cities onto the integers which will then have an
> almost perfect correlation.
>
> According to a paper I found this is related to the "principle of
> maximum entropy". The fact that you can't determine such functions
> easily in practice doesn't change the fact that zip codes and city
> names are highly correlated.

They're certainly very much not independent variables. There are lots of ways
of measuring how much dependence there is between them. I don't know enough
about the math to know if your maps are equivalent to any of them.

In any case as I described it's not enough information to know that the two
data sets are heavily dependent. You need to know for which pairs (or ntuples)
that dependency results in a higher density and for which it results in lower
density and how much higher or lower. That seems like a lot of information to
encode (and a lot to find in the sample).

Perhaps just knowing whether that there's a dependence between two data sets
might be somewhat useful if the planner kept a confidence value for all its
estimates. It would know to have a lower confidence value for estimates coming
from highly dependent clauses? It wouldn't be very easy for the planner to
distinguish "safe" plans for low confidence estimates and "risky" plans which
might blow up if the estimates are wrong though. And of course that's a lot
less interesting than just getting better estimates :)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!


From: Richard Huxton <dev(at)archonet(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 11:17:49
Message-ID: 48F8745D.1050205@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> They're certainly very much not independent variables. There are lots of ways
> of measuring how much dependence there is between them. I don't know enough
> about the math to know if your maps are equivalent to any of them.

I think "dependency" captures the way I think about it rather than
correlation (although I can see there must be function that could map
that dependency onto how we think of correlations).

> In any case as I described it's not enough information to know that the two
> data sets are heavily dependent. You need to know for which pairs (or ntuples)
> that dependency results in a higher density and for which it results in lower
> density and how much higher or lower. That seems like a lot of information to
> encode (and a lot to find in the sample).

Like Josh Berkus mentioned a few points back, it's the handful of
plan-changing values you're looking for.

So, it seems like we've got:
1. Implied dependencies: zip-code=>city
2. Implied+constraint: start-date < end-date and the difference between
the two is usually less than a week
3. "Top-heavy" foreign-key stats.

#1 and #2 obviously need new infrastructure.

From a non-dev point of view it looks like #3 could use the existing
stats on each side of the join. I'm not sure whether you could do
anything meaningful for joins that don't explicitly specify one side of
the join though.

> Perhaps just knowing whether that there's a dependence between two data sets
> might be somewhat useful if the planner kept a confidence value for all its
> estimates. It would know to have a lower confidence value for estimates coming
> from highly dependent clauses? It wouldn't be very easy for the planner to
> distinguish "safe" plans for low confidence estimates and "risky" plans which
> might blow up if the estimates are wrong though. And of course that's a lot
> less interesting than just getting better estimates :)

If we could abort a plan and restart then we could just try the
quick-but-risky plan and if we reach 50 rows rather than the expected 10
try a different approach. That way we'd not need to gather stats, just
react to the situation in individual queries.

--
Richard Huxton
Archonet Ltd


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Joshua Tolley <eggyknap(at)gmail(dot)com>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 12:46:11
Message-ID: 14436.1224247571@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> Just a note: using a multidimensional histograms will work well for the
> cases like (startdate,enddate) where the histogram will show a
> clustering of values along the diagonal. But it will fail for the case
> (zipcode,state) where one implies the other. Histogram-wise you're not
> going to see any correlation at all

Huh? Sure you are. What the histogram will show is that there is only
one state value per zipcode, and only a limited subset of zipcodes per
state. The nonempty cells won't cluster along the "diagonal" but we
don't particularly care about that.

What we really want from this is to not think that
WHERE zip = '80210' AND state = 'CA'
is significantly more selective than just
WHERE zip = '80210'
A histogram is certainly capable of telling us that. Whether it's the
most compact representation is another question of course --- in an
example like this, only about 1/50th of the cells would contain nonzero
counts ...

regards, tom lane


From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Joshua Tolley <eggyknap(at)gmail(dot)com>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 17:28:10
Message-ID: 48F8CB2A.6010805@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> A bad estimate for physical-position correlation has only limited
> impact,

Ah! This seems very true with 8.3 but much less true with 8.0.

On a legacy 8.0 system I have a hard time avoiding cases where
a query like
"select * from addresses where add_state_or_province = 'CA';"
does a 2-second full-table scan instead of a 300ms index scan
thanks to a poor physical order guess.

I just sucked the table into 8.3 and am pleased to say that
it picks a 200ms bitmap scan even with the misleading correlation.

Thanks for bitmap scans guys!

I'll shut up about this physical ordering stuff now
and try to do better upgrading before posting.


From: "Nathan Boley" <npboley(at)gmail(dot)com>
To: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-17 21:47:31
Message-ID: 6fa3b6e20810171447o43c5d28ar3bb98e2cf5b47e5a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>> Right now our
>>> "histogram" values are really quantiles; the statistics_target T for a
>>> column determines a number of quantiles we'll keep track of, and we
>>> grab values from into an ordered list L so that approximately 1/T of
>>> the entries in that column fall between values L[n] and L[n+1]. I'm
>>> thinking that multicolumn statistics would instead divide the range of
>>> each column up into T equally sized segments,
>>
>> Why would you not use the same histogram bin bounds derived for the
>> scalar stats (along each axis of the matrix, of course)? This seems to
>> me to be arbitrarily replacing something proven to work with something
>> not proven. Also, the above forces you to invent a concept of "equally
>> sized" ranges, which is going to be pretty bogus for a lot of datatypes.
>
> Because I'm trying to picture geometrically how this might work for
> the two-column case, and hoping to extend that to more dimensions, and
> am finding that picturing a quantile-based system like the one we have
> now in multiple dimensions is difficult. I believe those are the same
> difficulties Gregory Stark mentioned having in his first post in this
> thread. But of course that's an excellent point, that what we do now
> is proven. I'm not sure which problem will be harder to solve -- the
> weird geometry or the "equally sized ranges" for data types where that
> makes no sense.
>

Look at copulas. They are a completely general method of describing
the dependence between two marginal distributions. It seems silly to
rewrite the stats table in terms of joint distributions when we'll
still need the marginals anyways. Also, It might be easier to think of
the dimension reduction problem in that form.


From: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
To: "Nathan Boley" <npboley(at)gmail(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-18 00:47:38
Message-ID: e7e0a2570810171747y1dcfd9fck534320ca4ad61cd0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 17, 2008 at 3:47 PM, Nathan Boley <npboley(at)gmail(dot)com> wrote:
>>>> Right now our
>>>> "histogram" values are really quantiles; the statistics_target T for a
>>>> column determines a number of quantiles we'll keep track of, and we
>>>> grab values from into an ordered list L so that approximately 1/T of
>>>> the entries in that column fall between values L[n] and L[n+1]. I'm
>>>> thinking that multicolumn statistics would instead divide the range of
>>>> each column up into T equally sized segments,
>>>
>>> Why would you not use the same histogram bin bounds derived for the
>>> scalar stats (along each axis of the matrix, of course)? This seems to
>>> me to be arbitrarily replacing something proven to work with something
>>> not proven. Also, the above forces you to invent a concept of "equally
>>> sized" ranges, which is going to be pretty bogus for a lot of datatypes.
>>
>> Because I'm trying to picture geometrically how this might work for
>> the two-column case, and hoping to extend that to more dimensions, and
>> am finding that picturing a quantile-based system like the one we have
>> now in multiple dimensions is difficult. I believe those are the same
>> difficulties Gregory Stark mentioned having in his first post in this
>> thread. But of course that's an excellent point, that what we do now
>> is proven. I'm not sure which problem will be harder to solve -- the
>> weird geometry or the "equally sized ranges" for data types where that
>> makes no sense.
>>
>
> Look at copulas. They are a completely general method of describing
> the dependence between two marginal distributions. It seems silly to
> rewrite the stats table in terms of joint distributions when we'll
> still need the marginals anyways. Also, It might be easier to think of
> the dimension reduction problem in that form.
>

I'm still working my way around the math, but copulas sound better
than anything else I've been playing with. What's more, there are
plenty of existing implementations to refer to, provided it's done in
a licensing-friendly way. A multidimensional extension of our existing
stuff, at least in the ways I've been thinking of it, quickly becomes
a recursive problem -- perhaps some dynamic programming solution would
solve it, but copulas seem the more common solution for similar
problems in other fields. Thanks.

- Josh / eggyknap


From: "Nathan Boley" <npboley(at)gmail(dot)com>
To: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-18 01:54:39
Message-ID: 6fa3b6e20810171854p4bd1efe5o8e16653a52176727@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I'm still working my way around the math, but copulas sound better
> than anything else I've been playing with.

I think the easiest way to think of them is, in 2-D finite spaces,
they are just a plot of the order statistics against one another. Feel
free to mail me off list if you have any math questions.

I've previously thought that, at least in the 2D case, we could use
image compression algorithms to compress the copula, but recently I've
realized that this is a change point problem. In terms of compression,
we want to decompose the copula into regions that are as homogenous as
possible. I'm not familiar with change point problems in multiple
dimensions, but I'll try and ask someone that is, probably late next
week. If you decide to go the copula route, I'd be happy to write the
decomposition algorithm - or at least work on the theory.

Finally, a couple points that I hadn't seen mentioned earlier that
should probably be considered-

1) NULL's need to be treated specially - I suspect the assumption of
NULL independence is worse than other independence assumptions. Maybe
dealing with NULL dependence could be a half step towards full
dependence calculations?

2) Do we want to fold the MCV's into the dependence histogram? That
will cause problems in our copula approach but I'd hate to have to
keep an N^d histogram dependence relation in addition to the copula.

3) For equality selectivity estimates, I believe the assumption that
the ndistinct value distribution is uniform in the histogram will
become worse as the dimension increases. I proposed keeping track of
ndistinct per histogram beckets earlier in the marginal case partially
motivated by this exact scenario. Does that proposal make more sense
in this case? If so we'd need to store two distributions - one of the
counts and one of ndistinct.

4) How will this approach deal with histogram buckets that have
scaling count sizes ( ie -0.4 )?


From: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
To: "Nathan Boley" <npboley(at)gmail(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-18 20:33:06
Message-ID: e7e0a2570810181333y5d0432f8ld6e0dea5fa8cf285@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 17, 2008 at 7:54 PM, Nathan Boley <npboley(at)gmail(dot)com> wrote:
>> I'm still working my way around the math, but copulas sound better
>> than anything else I've been playing with.
>
> I think the easiest way to think of them is, in 2-D finite spaces,
> they are just a plot of the order statistics against one another. Feel
> free to mail me off list if you have any math questions.

I'm still working out what a copula is, and how I've got to figure out
the stuff below, too! :) /me has reading to do... The treatment of
copulae in one of my stats texts is much better than wikipedia, I
though, so I'm progressing. They sound like an excellent potential
solution the more I figure out what they are, for whatever my opinion
is worth, but making sure they work decently for all data types,
including those where there's no real notion of distance, may be
interesting. I still need to go through backend/utils/adt/selfuncs.c
to figure out exactly how we use the one-dimensional values.

> I've previously thought that, at least in the 2D case, we could use
> image compression algorithms to compress the copula, but recently I've
> realized that this is a change point problem. In terms of compression,
> we want to decompose the copula into regions that are as homogenous as
> possible. I'm not familiar with change point problems in multiple
> dimensions, but I'll try and ask someone that is, probably late next
> week. If you decide to go the copula route, I'd be happy to write the
> decomposition algorithm - or at least work on the theory.
>
> Finally, a couple points that I hadn't seen mentioned earlier that
> should probably be considered-
>
> 1) NULL's need to be treated specially - I suspect the assumption of
> NULL independence is worse than other independence assumptions. Maybe
> dealing with NULL dependence could be a half step towards full
> dependence calculations?

Agreed, though how to treat them I have no idea.

>
> 2) Do we want to fold the MCV's into the dependence histogram? That
> will cause problems in our copula approach but I'd hate to have to
> keep an N^d histogram dependence relation in addition to the copula.

Yeah, if we're already trying to figure out how to compress copulae,
having also to compress MCV matrices seems painful and error-prone.
But I'm not sure why it would cause problems to keep them in the
copula -- is that just because we are most interested in the copula
modeling the parts of the distribution that are most sparsely
populated?

> 3) For equality selectivity estimates, I believe the assumption that
> the ndistinct value distribution is uniform in the histogram will
> become worse as the dimension increases. I proposed keeping track of
> ndistinct per histogram beckets earlier in the marginal case partially
> motivated by this exact scenario. Does that proposal make more sense
> in this case? If so we'd need to store two distributions - one of the
> counts and one of ndistinct.

It's definitely worth investigating (or seeing if someone else has
already published an investigation). There are probably several
similar stats we could track and make use of, provided it didn't slow
the planner too much to make use of them.

> 4) How will this approach deal with histogram buckets that have
> scaling count sizes ( ie -0.4 )?

I'm not sure what you mean here.

- Josh / eggyknap


From: "Nathan Boley" <npboley(at)gmail(dot)com>
To: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Subject: Re: Cross-column statistics revisited
Date: 2008-10-19 17:09:47
Message-ID: 6fa3b6e20810191009n8fb56b7xb34b4e783c7e0ff0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I still need to go through backend/utils/adt/selfuncs.c
> to figure out exactly how we use the one-dimensional values.
>

Here's a page that helped me figure all this out.

http://www.postgresql.org/docs/8.1/static/planner-stats-details.html

>>
>> 2) Do we want to fold the MCV's into the dependence histogram? That
>> will cause problems in our copula approach but I'd hate to have to
>> keep an N^d histogram dependence relation in addition to the copula.
>
> Yeah, if we're already trying to figure out how to compress copulae,
> having also to compress MCV matrices seems painful and error-prone.
> But I'm not sure why it would cause problems to keep them in the
> copula -- is that just because we are most interested in the copula
> modeling the parts of the distribution that are most sparsely
> populated?
>

The problem I was thinking of is that we don't currently store the
true marginal distribution. As it stands, histograms only include non
mcv values. So we would either need to take the mcv's separately (
which would assume independence between mcv's and non-mcv values ) or
store multiple histograms.

>> 4) How will this approach deal with histogram buckets that have
>> scaling count sizes ( ie -0.4 )?
>
> I'm not sure what you mean here.
>

That was more a note to myself, and should have been numbered 3.5.
ndistinct estimates currently start to scale after a large enough
row/ndistinct ratio. If we try to model ndistinct, we need to deal
with scaling ndistinct counts somehow. But that's way off in the
future, it was probably pointless to mention it.