Re: WIP: multivariate statistics / proof of concept

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: multivariate statistics / proof of concept
Date: 2014-12-11 20:07:57
Message-ID: 5489F99D.60302@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11.12.2014 17:53, Heikki Linnakangas wrote:
> On 10/13/2014 01:00 AM, Tomas Vondra wrote:
>> Hi,
>>
>> attached is a WIP patch implementing multivariate statistics.
>
> Great! Really glad to see you working on this.
>
>> + * FIXME This sample sizing is mostly OK when computing stats for
>> + * individual columns, but when computing multi-variate stats
>> + * for multivariate stats (histograms, mcv, ...) it's rather
>> + * insufficient. For small number of dimensions it works, but
>> + * for complex stats it'd be nice use sample proportional to
>> + * the table (say, 0.5% - 1%) instead of a fixed size.
>
> I don't think a fraction of the table is appropriate. As long as the
> sample is random, the accuracy of a sample doesn't depend much on
> the size of the population. For example, if you sample 1,000 rows
> from a table with 100,000 rows, or 1000 rows from a table with
> 100,000,000 rows, the accuracy is pretty much the same. That doesn't
> change when you go from a single variable to multiple variables.

I might be wrong, but I doubt that. First, I read a number of papers
while working on this patch, and all of them used samples proportional
to the data set. That's an indirect evidence, though.

> You do need a bigger sample with multiple variables, however. My gut
> feeling is that if you sample N rows for a single variable, with two
> variables you need to sample N^2 rows to get the same accuracy. But
> it's not proportional to the table size. (I have no proof for that,
> but I'm sure there is literature on this.)

Maybe. I think it's somehow related to the number of buckets (which
somehow determines the precision of the histogram). If you want 1000
buckets, the number of rows scanned needs to be e.g. 10x that. With
multi-variate histograms, we may shoot for more buckets (say, 100 in
each dimension).

>
>> + * Multivariate histograms
>> + *
>> + * Histograms are a collection of buckets, represented by n-dimensional
>> + * rectangles. Each rectangle is delimited by an array of lower and
>> + * upper boundaries, so that for for the i-th attribute
>> + *
>> + * min[i] <= value[i] <= max[i]
>> + *
>> + * Each bucket tracks frequency (fraction of tuples it contains),
>> + * information about the inequalities, number of distinct values in
>> + * each dimension (which is used when building the histogram) etc.
>> + *
>> + * The boundaries may be either inclusive or exclusive, or the whole
>> + * dimension may be NULL.
>> + *
>> + * The buckets may overlap (assuming the build algorithm keeps the
>> + * frequencies additive) or may not cover the whole space (i.e. allow
>> + * gaps). This entirely depends on the algorithm used to build the
>> + * histogram.
>
> That sounds pretty exotic. These buckets are quite different from
> the single-dimension buckets we currently have.
>
> The paper you reference in partition_bucket() function, M.
> Muralikrishna, David J. DeWitt: Equi-Depth Histograms For Estimating
> Selectivity Factors For Multi-Dimensional Queries. SIGMOD Conference
> 1988: 28-36, actually doesn't mention overlapping buckets at all. I
> haven't read the code in detail, but if it implements the algorithm
> from that paper, there will be no overlap.

The algorithm implemented in partition_bucket() is very simple and
naive, and it mostly resembles the algorithm described in the paper. I'm
sure there are differences, it's not a 1:1 implementation, but you're
right it produces non-overlapping buckets.

The point is that I envision more complex algorithms or different
histogram types, and some of them may produce overlapping buckets. Maybe
that's premature comment, and it will turn out it's not really necessary.

regards
Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-12-11 20:12:23 Re: WIP patch for Oid formatting in printf/elog strings
Previous Message Peter Geoghegan 2014-12-11 20:07:16 Re: 9.5 release scheduling (was Re: logical column ordering)