Re: multivariate statistics (v19)

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, david(at)pgmasters(dot)net, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, jeff(dot)janes(at)gmail(dot)com, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: multivariate statistics (v19)
Date: 2016-08-10 12:24:55
Message-ID: CAB7nPqTCmL5ojXi4u-FX-u+J5mvhUaqsyz7zyK+6eojz2Ynuog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
> On 10/08/16 13:33, Tomas Vondra wrote:
>>
>> On 08/10/2016 06:41 AM, Michael Paquier wrote:
>>>
>>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra
>>>>
>>>> 2) combining multiple statistics
>>>>
>>>>
>>>> I think the ability to combine multivariate statistics (covering
>>>> different
>>>> subsets of conditions) is important and useful, but I'm starting to
>>>> think
>>>> that the current implementation may not be the correct one (which is
>>>> why I
>>>> haven't written the SGML docs about this part of the patch series yet).
>>>>
>>>> Assume there's a table "t" with 3 columns (a, b, c), and that we're
>>>> estimating query:
>>>>
>>>> SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3
>>>>
>>>> but that we only have two statistics (a,b) and (b,c). The current
>>>> patch does
>>>> about this:
>>>>
>>>> P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2)
>>>>
>>>> i.e. it estimates the first two conditions using (a,b), and then
>>>> estimates
>>>> (c=3) using (b,c) with "b=2" as a condition. Now, this is very
>>>> efficient,
>>>> but it only works as long as the query contains conditions
>>>> "connecting" the
>>>> two statistics. So if we remove the "b=2" condition from the query, this
>>>> stops working.
>>>
>>>
>>> This is trying to make the algorithm smarter than the user, which is
>>> something I'd think we could live without. In this case statistics on
>>> (a,c) or (a,b,c) are missing. And what if the user does not want to
>>> make use of stats for (a,c) because he only defined (a,b) and (b,c)?
>>>
>>
>> I don't think so. Obviously, if you have statistics covering all the
>> conditions - great, we can't really do better than that.
>>
>> But there's a crucial relation between the number of dimensions of the
>> statistics and accuracy of the statistics. Let's say you have statistics
>> on 8 columns, and you split each dimension twice to build a histogram -
>> that's 256 buckets right there, and we only get ~50% selectivity in each
>> dimension (the actual histogram building algorithm is more complex, but
>> you get the idea).
>
> I think it makes sense to pursue this, but I also think we can easily live
> with not having it in the first version that gets committed and doing it as
> follow-up patch.

This patch is large and complicated enough. As this is not a mandatory
piece to get a basic support, I'd suggest just to drop that for later.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ants Aasma 2016-08-10 13:29:04 Re: multivariate statistics (v19)
Previous Message Michael Paquier 2016-08-10 12:23:29 Re: multivariate statistics (v19)