Cross-column statistics revisited

From: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Cross-column statistics revisited
Date: 2008-10-15 10:53:10
Message-ID: e7e0a2570810150353h49826e8bh9dde3676109a7574@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been interested in what it would take to start tracking
cross-column statistics. A review of the mailing lists as linked from
the TODO item on the subject [1] suggests the following concerns:

1) What information exactly would be tracked?
2) How would it be kept from exploding in size?
3) For which combinations of columns would statistics be kept?

The major concern in #1 seemed to be that the most suitable form for
keeping most common value lists, histograms, etc. is in an array, and
at the time of the posts I read, arrays of composite types weren't
possible. This seems much less of a concern now -- perhaps in greatest
part because a test I just did against a recent 8.4devel sure makes it
look like stats on composite type columns aren't even kept. The most
straightforward is that we'd keep a simple multi-dimensional
histogram, but that leads to a discussion of #2.

#2 is an interesting problem, and possibly the most difficult from a
theoretical perspective. Provided a fair expectation that the results
would be useful, I'd like to play with various forms of statistical
regressions to compress large cross-column histograms. But I'm not
sure I want to spend the time if there are other serious issues with
the whole idea of cross-column stats. Another possibility involves not
storing a histogram at all, but instead storing an array of quantiles
for each column. In other words, if S represents the statistics target
of a column, we'd have an array A of S+1 entries of the type in
question, where the beginning and end of the array are the minimum and
maximum values in the column, and the range between any two
consecutive array elements A[n] and A[n+1] contains 1/S of the total
entries in the column. For a uniformly distributed column, these array
elements would be evenly spaced across the total range of the column;
the difference between consecutive array elements would indicate
deviations from this distribution. Unfortunately this only works if
there's a valid distance metric for the data type in question.

In the archives, #3 was raised frequently as a concern -- obviously
tracking stats on all permutations of the columns in a database is a
non-starter. However there seemed to be little argument over sensible
defaults, namely that columns involved in a multi-column index would
have cross-column statistics kept automatically, and the user would be
allowed to instruct the server to keep stats on other arbitrary groups
of columns. There was some discussion about the fact that the user
would have to be smart about how (s)he used this feature, but there
are enough other potential foot guns in the system of similar
magnitude that worrying too much about that seems of little use.

I bring all this up because I'm interested in looking at how this
might be done, and perhaps doing it if 1) time permits, and 2) someone
else doesn't beat me to it. Comments, anyone? Am I missing something
obvious/serious/etc.? Thanks in advance.

- Josh / eggyknap

[1] "Allow accurate statistics to be collected on indexes with more
than one column or expression indexes, perhaps using per-index
statistics", http://wiki.postgresql.org/wiki/Todo

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2008-10-15 13:16:38 Re: 8.3 .4 + Vista + MingW + initdb = ACCESS_DENIED
Previous Message Dave Page 2008-10-15 09:59:15 Re: 8.3 .4 + Vista + MingW + initdb = ACCESS_DENIED