WIP: collect frequency statistics for arrays

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: WIP: collect frequency statistics for arrays
Date: 2011-02-23 15:00:09
Message-ID: AANLkTin02SOXNzFxju74Hf-uqhAmymR0-hyAodi4G0Of@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

WIP patch of statistics collection for arrays is attached. It generally
copies statistics collection for tsvector, but there are following
differencies:
1) Default comparison, hash and equality function for element data type is
used (from corresponding default operator classes).
2) Operators @> and && don't takes care about element occurence count in
array, i.e. '{1}':int[] @> '{1,1}':int[] and so on. That's why statistics
collection and selectivity estimation functions takes care about uniqueness
counting of array element.
3) array_typanalyze collects frequency of null element into separate value
(like maximum and minimum frequencies in ts_typanalyze). Currently it is not
used in selectivity estimation, but it can be useful in future.

Also I've faced with following problems:
1) Do selectivity estimation for ANY and ALL keywords seems not so easy as
for operators because their selectivity is estimating inside planner. So
it's required to modify planner to do selectivity estimation for these
keywords. Probably I'm missing something.
2) I didn't implement selectivity estimation for "column <@ const"
and "column == const" cases. The problem of "column <@ const" case is that
we need to estimate frequency of occurence of any element not in const. We
can try to collect statistics of frequency of all elements which is not in
most common elements based on assumption of their independent occurence. But
I'm not sure that this statistic will be precise enough. "column == const"
case have also another problem. @> and && operators don't takes care about
element occurence count and order while == operator require exact match.
That's why statistics for @> and && operators can be applied to == very
approximately.
3) I need to test selectivity estimation for arrays. But it's hard to
understand which distributions is typical for arrays. For example, we know
that data in tsvector is based on natural language data, so we can assume
something about data distribution in tsvector. But we don't know almost
nothing about data in arrays because it can hold any data (tsvector also can
holds any data, but it using for non nutural language data is out of
purpose).

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
arrayanalyze-0.1.patch.gz application/x-gzip 9.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2011-02-23 15:09:14 Re: Correctly producing array literals for prepared statements
Previous Message PostgreSQL - Hans-Jürgen Schönig 2011-02-23 14:56:59 Re: WIP: cross column correlation ...