Re: WIP: multivariate statistics / proof of concept

From: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: multivariate statistics / proof of concept
Date: 2016-12-02 12:02:47
Message-ID: CAJrrPGe43PnrZ1nyTPnRY4hp6zr__kD5b7uhboU6yemvx0XAbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 22, 2016 at 2:42 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> On 11/21/2016 11:10 PM, Robert Haas wrote:
>
>> [ reviving an old multivariate statistics thread ]
>>
>> On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs <simon(at)2ndquadrant(dot)com>
>> wrote:
>>
>>> On 12 October 2014 23:00, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>>
>>> It however seems to be working sufficiently well at this point, enough
>>>> to get some useful feedback. So here we go.
>>>>
>>>
>>> This looks interesting and useful.
>>>
>>> What I'd like to check before a detailed review is that this has
>>> sufficient applicability to be useful.
>>>
>>> My understanding is that Q9 and Q18 of TPC-H have poor plans as a
>>> result of multi-column stats errors.
>>>
>>> Could you look at those queries and confirm that this patch can
>>> produce better plans for them?
>>>
>>
>> Tomas, did you ever do any testing in this area? One of my
>> colleagues, Rafia Sabih, recently did some testing of TPC-H queries @
>> 20 GB. Q18 actually doesn't complete at all right now because of an
>> issue with the new simplehash implementation. I reported it to Andres
>> and he tracked it down, but hasn't posted the patch yet - see
>> http://archives.postgresql.org/message-id/20161115192802.jfb
>> ec5s6ougxwicp(at)alap3(dot)anarazel(dot)de
>>
>> Of the remaining queries, the slowest are Q9 and Q20, and both of them
>> have serious estimation errors. On Q9, things go wrong here:
>>
>> -> Merge Join
>> (cost=5225092.04..6595105.57 rows=154 width=47) (actual
>> time=103592.821..149335.010 rows=6503988 loops=1)
>> Merge Cond:
>> (partsupp.ps_partkey = lineitem.l_partkey)
>> Join Filter:
>> (lineitem.l_suppkey = partsupp.ps_suppkey)
>> Rows Removed by Join Filter:
>> 19511964
>> -> Index Scan using
>>
> > [snip]
>
>>
>> Rows Removed by Filter: 756627
>>
>> The estimate for the index scan on partsupp is essentially perfect,
>> and the lineitem-part join is off by about 3x. However, the merge
>> join is off by about 4000x, which is real bad.
>>
>>
> The patch only deals with statistics on base relations, no joins, at this
> point. It's meant to be extended in that direction, so the syntax supports
> it, but at this point that's all. No joins.
>
> That being said, this estimate should be improved in 9.6, when you create
> a foreign key between the tables. In fact, that patch was exactly about Q9.
>
> This is how the join estimate looks on scale 1 without the FK between the
> two tables:
>
> QUERY PLAN
> -----------------------------------------------------------------------
> Merge Join (cost=19.19..700980.12 rows=2404 width=261)
> Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
> (lineitem.l_suppkey = partsupp.ps_suppkey))
> -> Index Scan using idx_lineitem_part_supp on lineitem
> (cost=0.43..605856.84 rows=6001117 width=117)
> -> Index Scan using partsupp_pkey on partsupp
> (cost=0.42..61141.76 rows=800000 width=144)
> (4 rows)
>
>
> and with the foreign key:
>
> QUERY PLAN
> -----------------------------------------------------------------------
> Merge Join (cost=19.19..700980.12 rows=6001117 width=261)
> (actual rows=6001215 loops=1)
> Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND
> (lineitem.l_suppkey = partsupp.ps_suppkey))
> -> Index Scan using idx_lineitem_part_supp on lineitem
> (cost=0.43..605856.84 rows=6001117 width=117)
> (actual rows=6001215 loops=1)
> -> Index Scan using partsupp_pkey on partsupp
> (cost=0.42..61141.76 rows=800000 width=144)
> (actual rows=6001672 loops=1)
> Planning time: 3.840 ms
> Execution time: 21987.913 ms
> (6 rows)
>
>
> On Q20, things go wrong here:
>>
> >
>
>> [snip]
>>
>> The estimate for the GroupAggregate feeding one side of the merge join
>> is quite accurate. The estimate for the part-partsupp join on the
>> other side is off by 8x. Then things get much worse: the estimate for
>> the merge join is off by 400x.
>>
>>
> Well, most of the estimation error comes from the join, but sadly the
> aggregate makes using the foreign keys impossible - at least in the current
> version. I don't know if it can be improved, somehow.
>
> I'm not really sure whether the multivariate statistics stuff will fix
>> this kind of case or not, but if it did it would be awesome.
>>
>>
> Join statistics are something I'd like to add eventually, but I don't see
> how it could happen in the first version. Also, the patch received no
> reviews this CF, and making it even larger is unlikely to make it more
> attractive.
>

Moved to next CF with "needs review" status.

Regards,
Hari Babu
Fujitsu Australia

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Haribabu Kommi 2016-12-02 12:05:34 Re: Dynamic shared memory areas
Previous Message Haribabu Kommi 2016-12-02 11:59:18 Re: DROP FUNCTION of multiple functions