Re: multivariate statistics v14

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: jeff(dot)janes(at)gmail(dot)com, alvherre(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: multivariate statistics v14
Date: 2016-03-23 13:21:53
Message-ID: eb8522d2-f1ee-7401-2118-e7d6b8990483@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/23/2016 06:20 AM, Tatsuo Ishii wrote:
>>> I am now looking into the create statistics doc to see if the example
>>> appearing in it is working. I will get back if I find any.
>
> I have the ref doc: CREATE STATISTICS
>
> There are nice examples how the multivariate statistics gives better
> row number estimation. So I gave them a try.
>
> "Create table t1 with two functionally dependent columns,
> i.e. knowledge of a value in the first column is sufficient for
> determining the value in the other column" The example creates table
> "t1", then populates it using generate_series. After CREATE
> STATISTICS, ANALYZE and EXPLAIN. I expected the EXPLAIN demonstrates
> how result rows estimation is enhanced by using the multivariate
> statistics.
>
> Here is the EXPLAIN output using the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------
> Seq Scan on t1 (cost=0.00..19425.00 rows=98 width=8) (actual time=76.876..76.876 rows=0 loops=1)
> Filter: ((a = 1) AND (b = 1))
> Rows Removed by Filter: 1000000
> Planning time: 0.146 ms
> Execution time: 76.896 ms
> (5 rows)
>
> Here is the EXPLAIN output without the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 1);
> QUERY PLAN
> --------------------------------------------------------------------------------------------------
> Seq Scan on t1 (cost=0.00..19425.00 rows=1 width=8) (actual time=78.867..78.867 rows=0 loops=1)
> Filter: ((a = 1) AND (b = 1))
> Rows Removed by Filter: 1000000
> Planning time: 0.102 ms
> Execution time: 78.885 ms
> (5 rows)
>
> It seems the row numbers estimation (98) using the multivariate
> statistics is actually *worse* than the one (1) not using the
> statistics because the actual row number is 0.

Yes, there's a mistake in the first query, because the conditions
actually are not compatible. I.e. (i/100)=1 and (i/500)=1 have no
overlapping rows, clearly. It should be

EXPLAIN ANALYZE SELECT * FROM t1 WHERE (a = 1) AND (b = 0);

instead. Will fix.

>
> Next example (using table "t2") is much better than the case using t1.
>
> Here is the EXPLAIN output using the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------
> Seq Scan on t2 (cost=0.00..19425.00 rows=9633 width=8) (actual time=0.012..75.350 rows=10000 loops=1)
> Filter: ((a = 1) AND (b = 1))
> Rows Removed by Filter: 990000
> Planning time: 0.107 ms
> Execution time: 75.680 ms
> (5 rows)
>
> Here is the EXPLAIN output without the multivariate statistics:
>
> EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------
> Seq Scan on t2 (cost=0.00..19425.00 rows=91 width=8) (actual time=0.008..76.614 rows=10000 loops=1)
> Filter: ((a = 1) AND (b = 1))
> Rows Removed by Filter: 990000
> Planning time: 0.067 ms
> Execution time: 76.935 ms
> (5 rows)
>
> This time it seems the row numbers estimation (9633) using the
> multivariate statistics is much better than the one (91) not using the
> statistics because the actual row number is 10000.
>
> The last example (using table "t3") seems no effect by multivariate statistics.

Yes. There's a typo in the example - it analyzes the wrong table (t2
instead of t3). Once I fix that, the estimates are much better.

> In summary, the only case which shows the effect of the multivariate
> statistics is the "t2" case. So I don't see why other examples are
> shown in the manual. Am I missing something?

No, thanks for spotting those mistakes. I'll fix them and submit a new
version of the patch - either later today or perhaps tomorrow.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-03-23 14:01:04 Re: Bug in searching path in jsonb_set when walking through JSONB array
Previous Message Merlin Moncure 2016-03-23 13:21:48 Re: NOT EXIST for PREPARE