Estimation question...

Lists: pgsql-performance
From: Matt Daw <matt(at)shotgunsoftware(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Estimation question...
Date: 2013-02-26 19:35:45
Message-ID: CAA2LLOEy2n6swjgBRGp2DZvVtMg4AGYSyA2FKFd6FKxscv5rSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Howdy, the query generator in my app sometimes creates redundant
filters of the form:

project_id IN ( <list of projects user has permission to see> ) AND
project_id = <single project user is looking at >

... and this is leading to a bad estimate (and thus a bad plan) on a
few complex queries. I've included simplified examples below. This
server is running 9.0.10 and the statistics target has been updated to
1000 on the project_id column. I've also loaded the one table into a
9.2.2 instance and replicated the behaviour.

I can change how the query is being generated, but I'm curious why I'm
getting a bad estimate. Is this an expected result?

Thanks!

Matt

=============

1) Filter on project_id only, row estimate for Bitmap Index Scan quite good.

explain (analyze,buffers) select count(id) from versions WHERE project_id=115;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1218111.01..1218111.02 rows=1 width=4) (actual
time=1531.341..1531.342 rows=1 loops=1)
Buffers: shared hit=452619
-> Bitmap Heap Scan on versions (cost=34245.06..1215254.86
rows=1142461 width=4) (actual time=148.394..1453.383 rows=1114197
loops=1)
Recheck Cond: (project_id = 115)
Buffers: shared hit=452619
-> Bitmap Index Scan on versions_project_id
(cost=0.00..33959.45 rows=1142461 width=0) (actual
time=139.709..139.709 rows=1116037 loops=1)
Index Cond: (project_id = 115)
Buffers: shared hit=22077
Total runtime: 1531.399 ms

2) Filter on project_id IN () AND project_id. Row estimate is ~10x lower.

explain (analyze,buffers) select count(id) from versions WHERE
project_id IN (80,115) AND project_id=115;;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=327066.18..327066.19 rows=1 width=4) (actual
time=1637.889..1637.889 rows=1 loops=1)
Buffers: shared hit=458389
-> Bitmap Heap Scan on versions (cost=3546.56..326793.17
rows=109201 width=4) (actual time=155.107..1557.453 rows=1114180
loops=1)
Recheck Cond: ((project_id = ANY ('{80,115}'::integer[])) AND
(project_id = 115))
Buffers: shared hit=458389
-> Bitmap Index Scan on versions_project_id
(cost=0.00..3519.26 rows=109201 width=0) (actual time=145.502..145.502
rows=1125436 loops=1)
Index Cond: ((project_id = ANY ('{80,115}'::integer[]))
AND (project_id = 115))
Buffers: shared hit=22076
Total runtime: 1637.941 ms


From: Matt Daw <matt(at)shotgunsoftware(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Estimation question...
Date: 2013-02-27 17:08:45
Message-ID: CAA2LLOEKOQF4RNRriPO+gpfEkksQQkp8TF7xqUm4V_CNqfhpDw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Quick follow up... I've found that the row estimate in:

explain select count(id) from versions where project_id IN (80,115)
AND project_id=115;
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Aggregate (cost=178572.75..178572.76 rows=1 width=4)
-> Index Scan using dneg_versions_project_id on versions
(cost=0.00..178306.94 rows=106323 width=4)
Index Cond: ((project_id = ANY ('{80,115}'::integer[])) AND
(project_id = 115))

... is the sum of two other estimates, seen when rewriting the query
using OR instead of IN:

explain select count(id) from versions where (project_id = 80 OR
project_id = 115) AND project_id=115;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Aggregate (cost=305896.95..305896.96 rows=1 width=4)
-> Bitmap Heap Scan on versions (cost=2315.08..305632.00
rows=105980 width=4)
Recheck Cond: (((project_id = 80) AND (project_id = 115)) OR
((project_id = 115) AND (project_id = 115)))
-> BitmapOr (cost=2315.08..2315.08 rows=106323 width=0)
-> Bitmap Index Scan on dneg_versions_project_id
(cost=0.00..94.52 rows=3709 width=0)
Index Cond: ((project_id = 80) AND (project_id = 115))
-> Bitmap Index Scan on dneg_versions_project_id
(cost=0.00..2167.57 rows=102614 width=0)
Index Cond: ((project_id = 115) AND (project_id = 115))

106323 = 3709 + 102614

Looks like the underlying problem is that the estimate for
((project_id = 115) AND (project_id = 115)) doesn't end up being the
same as (project_id=115) on its own.

Matt

On Tue, Feb 26, 2013 at 11:35 AM, Matt Daw <matt(at)shotgunsoftware(dot)com> wrote:
> Howdy, the query generator in my app sometimes creates redundant
> filters of the form:
>
> project_id IN ( <list of projects user has permission to see> ) AND
> project_id = <single project user is looking at >
>
> ... and this is leading to a bad estimate (and thus a bad plan) on a
> few complex queries. I've included simplified examples below. This
> server is running 9.0.10 and the statistics target has been updated to
> 1000 on the project_id column. I've also loaded the one table into a
> 9.2.2 instance and replicated the behaviour.
>
> I can change how the query is being generated, but I'm curious why I'm
> getting a bad estimate. Is this an expected result?
>
> Thanks!
>
> Matt
>
> =============
>
> 1) Filter on project_id only, row estimate for Bitmap Index Scan quite good.
>
> explain (analyze,buffers) select count(id) from versions WHERE project_id=115;
>
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=1218111.01..1218111.02 rows=1 width=4) (actual
> time=1531.341..1531.342 rows=1 loops=1)
> Buffers: shared hit=452619
> -> Bitmap Heap Scan on versions (cost=34245.06..1215254.86
> rows=1142461 width=4) (actual time=148.394..1453.383 rows=1114197
> loops=1)
> Recheck Cond: (project_id = 115)
> Buffers: shared hit=452619
> -> Bitmap Index Scan on versions_project_id
> (cost=0.00..33959.45 rows=1142461 width=0) (actual
> time=139.709..139.709 rows=1116037 loops=1)
> Index Cond: (project_id = 115)
> Buffers: shared hit=22077
> Total runtime: 1531.399 ms
>
> 2) Filter on project_id IN () AND project_id. Row estimate is ~10x lower.
>
> explain (analyze,buffers) select count(id) from versions WHERE
> project_id IN (80,115) AND project_id=115;;
>
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=327066.18..327066.19 rows=1 width=4) (actual
> time=1637.889..1637.889 rows=1 loops=1)
> Buffers: shared hit=458389
> -> Bitmap Heap Scan on versions (cost=3546.56..326793.17
> rows=109201 width=4) (actual time=155.107..1557.453 rows=1114180
> loops=1)
> Recheck Cond: ((project_id = ANY ('{80,115}'::integer[])) AND
> (project_id = 115))
> Buffers: shared hit=458389
> -> Bitmap Index Scan on versions_project_id
> (cost=0.00..3519.26 rows=109201 width=0) (actual time=145.502..145.502
> rows=1125436 loops=1)
> Index Cond: ((project_id = ANY ('{80,115}'::integer[]))
> AND (project_id = 115))
> Buffers: shared hit=22076
> Total runtime: 1637.941 ms


From: Matt Daw <matt(at)shotgunsoftware(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Estimation question...
Date: 2013-02-28 16:31:45
Message-ID: CAA2LLOHeUzi_swG4dcgd0iimSU0yT0rgw0_M18sO0f3k2pJHMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

I get it now... project_id=115 has a frequency of 0.09241 in pg_stats.
So if ((project_id = 115) AND (project_id = 115)) is considered as two
independent conditions, the row estimate ends up being 0.09241 *
0.09241 * 1.20163e+07 (reltuples from pg_class) = 102614.

http://www.postgresql.org/docs/9.0/static/row-estimation-examples.html
was a big help.

Matt

On Wed, Feb 27, 2013 at 9:08 AM, Matt Daw <matt(at)shotgunsoftware(dot)com> wrote:
> Quick follow up... I've found that the row estimate in:
>
> explain select count(id) from versions where project_id IN (80,115)
> AND project_id=115;
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------
> Aggregate (cost=178572.75..178572.76 rows=1 width=4)
> -> Index Scan using dneg_versions_project_id on versions
> (cost=0.00..178306.94 rows=106323 width=4)
> Index Cond: ((project_id = ANY ('{80,115}'::integer[])) AND
> (project_id = 115))
>
>
> ... is the sum of two other estimates, seen when rewriting the query
> using OR instead of IN:
>
>
> explain select count(id) from versions where (project_id = 80 OR
> project_id = 115) AND project_id=115;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=305896.95..305896.96 rows=1 width=4)
> -> Bitmap Heap Scan on versions (cost=2315.08..305632.00
> rows=105980 width=4)
> Recheck Cond: (((project_id = 80) AND (project_id = 115)) OR
> ((project_id = 115) AND (project_id = 115)))
> -> BitmapOr (cost=2315.08..2315.08 rows=106323 width=0)
> -> Bitmap Index Scan on dneg_versions_project_id
> (cost=0.00..94.52 rows=3709 width=0)
> Index Cond: ((project_id = 80) AND (project_id = 115))
> -> Bitmap Index Scan on dneg_versions_project_id
> (cost=0.00..2167.57 rows=102614 width=0)
> Index Cond: ((project_id = 115) AND (project_id = 115))
>
> 106323 = 3709 + 102614
>
> Looks like the underlying problem is that the estimate for
> ((project_id = 115) AND (project_id = 115)) doesn't end up being the
> same as (project_id=115) on its own.
>
> Matt
>
> On Tue, Feb 26, 2013 at 11:35 AM, Matt Daw <matt(at)shotgunsoftware(dot)com> wrote:
>> Howdy, the query generator in my app sometimes creates redundant
>> filters of the form:
>>
>> project_id IN ( <list of projects user has permission to see> ) AND
>> project_id = <single project user is looking at >
>>
>> ... and this is leading to a bad estimate (and thus a bad plan) on a
>> few complex queries. I've included simplified examples below. This
>> server is running 9.0.10 and the statistics target has been updated to
>> 1000 on the project_id column. I've also loaded the one table into a
>> 9.2.2 instance and replicated the behaviour.
>>
>> I can change how the query is being generated, but I'm curious why I'm
>> getting a bad estimate. Is this an expected result?
>>
>> Thanks!
>>
>> Matt
>>
>> =============
>>
>> 1) Filter on project_id only, row estimate for Bitmap Index Scan quite good.
>>
>> explain (analyze,buffers) select count(id) from versions WHERE project_id=115;
>>
>> QUERY PLAN
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------
>> Aggregate (cost=1218111.01..1218111.02 rows=1 width=4) (actual
>> time=1531.341..1531.342 rows=1 loops=1)
>> Buffers: shared hit=452619
>> -> Bitmap Heap Scan on versions (cost=34245.06..1215254.86
>> rows=1142461 width=4) (actual time=148.394..1453.383 rows=1114197
>> loops=1)
>> Recheck Cond: (project_id = 115)
>> Buffers: shared hit=452619
>> -> Bitmap Index Scan on versions_project_id
>> (cost=0.00..33959.45 rows=1142461 width=0) (actual
>> time=139.709..139.709 rows=1116037 loops=1)
>> Index Cond: (project_id = 115)
>> Buffers: shared hit=22077
>> Total runtime: 1531.399 ms
>>
>> 2) Filter on project_id IN () AND project_id. Row estimate is ~10x lower.
>>
>> explain (analyze,buffers) select count(id) from versions WHERE
>> project_id IN (80,115) AND project_id=115;;
>>
>> QUERY PLAN
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>> Aggregate (cost=327066.18..327066.19 rows=1 width=4) (actual
>> time=1637.889..1637.889 rows=1 loops=1)
>> Buffers: shared hit=458389
>> -> Bitmap Heap Scan on versions (cost=3546.56..326793.17
>> rows=109201 width=4) (actual time=155.107..1557.453 rows=1114180
>> loops=1)
>> Recheck Cond: ((project_id = ANY ('{80,115}'::integer[])) AND
>> (project_id = 115))
>> Buffers: shared hit=458389
>> -> Bitmap Index Scan on versions_project_id
>> (cost=0.00..3519.26 rows=109201 width=0) (actual time=145.502..145.502
>> rows=1125436 loops=1)
>> Index Cond: ((project_id = ANY ('{80,115}'::integer[]))
>> AND (project_id = 115))
>> Buffers: shared hit=22076
>> Total runtime: 1637.941 ms

On Wed, Feb 27, 2013 at 9:08 AM, Matt Daw <matt(at)shotgunsoftware(dot)com> wrote:
> Quick follow up... I've found that the row estimate in:
>
> explain select count(id) from versions where project_id IN (80,115)
> AND project_id=115;
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------
> Aggregate (cost=178572.75..178572.76 rows=1 width=4)
> -> Index Scan using dneg_versions_project_id on versions
> (cost=0.00..178306.94 rows=106323 width=4)
> Index Cond: ((project_id = ANY ('{80,115}'::integer[])) AND
> (project_id = 115))
>
>
> ... is the sum of two other estimates, seen when rewriting the query
> using OR instead of IN:
>
>
> explain select count(id) from versions where (project_id = 80 OR
> project_id = 115) AND project_id=115;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=305896.95..305896.96 rows=1 width=4)
> -> Bitmap Heap Scan on versions (cost=2315.08..305632.00
> rows=105980 width=4)
> Recheck Cond: (((project_id = 80) AND (project_id = 115)) OR
> ((project_id = 115) AND (project_id = 115)))
> -> BitmapOr (cost=2315.08..2315.08 rows=106323 width=0)
> -> Bitmap Index Scan on dneg_versions_project_id
> (cost=0.00..94.52 rows=3709 width=0)
> Index Cond: ((project_id = 80) AND (project_id = 115))
> -> Bitmap Index Scan on dneg_versions_project_id
> (cost=0.00..2167.57 rows=102614 width=0)
> Index Cond: ((project_id = 115) AND (project_id = 115))
>
> 106323 = 3709 + 102614
>
> Looks like the underlying problem is that the estimate for
> ((project_id = 115) AND (project_id = 115)) doesn't end up being the
> same as (project_id=115) on its own.
>
> Matt
>
> On Tue, Feb 26, 2013 at 11:35 AM, Matt Daw <matt(at)shotgunsoftware(dot)com> wrote:
>> Howdy, the query generator in my app sometimes creates redundant
>> filters of the form:
>>
>> project_id IN ( <list of projects user has permission to see> ) AND
>> project_id = <single project user is looking at >
>>
>> ... and this is leading to a bad estimate (and thus a bad plan) on a
>> few complex queries. I've included simplified examples below. This
>> server is running 9.0.10 and the statistics target has been updated to
>> 1000 on the project_id column. I've also loaded the one table into a
>> 9.2.2 instance and replicated the behaviour.
>>
>> I can change how the query is being generated, but I'm curious why I'm
>> getting a bad estimate. Is this an expected result?
>>
>> Thanks!
>>
>> Matt
>>
>> =============
>>
>> 1) Filter on project_id only, row estimate for Bitmap Index Scan quite good.
>>
>> explain (analyze,buffers) select count(id) from versions WHERE project_id=115;
>>
>> QUERY PLAN
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------
>> Aggregate (cost=1218111.01..1218111.02 rows=1 width=4) (actual
>> time=1531.341..1531.342 rows=1 loops=1)
>> Buffers: shared hit=452619
>> -> Bitmap Heap Scan on versions (cost=34245.06..1215254.86
>> rows=1142461 width=4) (actual time=148.394..1453.383 rows=1114197
>> loops=1)
>> Recheck Cond: (project_id = 115)
>> Buffers: shared hit=452619
>> -> Bitmap Index Scan on versions_project_id
>> (cost=0.00..33959.45 rows=1142461 width=0) (actual
>> time=139.709..139.709 rows=1116037 loops=1)
>> Index Cond: (project_id = 115)
>> Buffers: shared hit=22077
>> Total runtime: 1531.399 ms
>>
>> 2) Filter on project_id IN () AND project_id. Row estimate is ~10x lower.
>>
>> explain (analyze,buffers) select count(id) from versions WHERE
>> project_id IN (80,115) AND project_id=115;;
>>
>> QUERY PLAN
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>> Aggregate (cost=327066.18..327066.19 rows=1 width=4) (actual
>> time=1637.889..1637.889 rows=1 loops=1)
>> Buffers: shared hit=458389
>> -> Bitmap Heap Scan on versions (cost=3546.56..326793.17
>> rows=109201 width=4) (actual time=155.107..1557.453 rows=1114180
>> loops=1)
>> Recheck Cond: ((project_id = ANY ('{80,115}'::integer[])) AND
>> (project_id = 115))
>> Buffers: shared hit=458389
>> -> Bitmap Index Scan on versions_project_id
>> (cost=0.00..3519.26 rows=109201 width=0) (actual time=145.502..145.502
>> rows=1125436 loops=1)
>> Index Cond: ((project_id = ANY ('{80,115}'::integer[]))
>> AND (project_id = 115))
>> Buffers: shared hit=22076
>> Total runtime: 1637.941 ms

On Wed, Feb 27, 2013 at 9:08 AM, Matt Daw <matt(at)shotgunsoftware(dot)com> wrote:
> Quick follow up... I've found that the row estimate in:
>
> explain select count(id) from versions where project_id IN (80,115)
> AND project_id=115;
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------
> Aggregate (cost=178572.75..178572.76 rows=1 width=4)
> -> Index Scan using dneg_versions_project_id on versions
> (cost=0.00..178306.94 rows=106323 width=4)
> Index Cond: ((project_id = ANY ('{80,115}'::integer[])) AND
> (project_id = 115))
>
>
> ... is the sum of two other estimates, seen when rewriting the query
> using OR instead of IN:
>
>
> explain select count(id) from versions where (project_id = 80 OR
> project_id = 115) AND project_id=115;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=305896.95..305896.96 rows=1 width=4)
> -> Bitmap Heap Scan on versions (cost=2315.08..305632.00
> rows=105980 width=4)
> Recheck Cond: (((project_id = 80) AND (project_id = 115)) OR
> ((project_id = 115) AND (project_id = 115)))
> -> BitmapOr (cost=2315.08..2315.08 rows=106323 width=0)
> -> Bitmap Index Scan on dneg_versions_project_id
> (cost=0.00..94.52 rows=3709 width=0)
> Index Cond: ((project_id = 80) AND (project_id = 115))
> -> Bitmap Index Scan on dneg_versions_project_id
> (cost=0.00..2167.57 rows=102614 width=0)
> Index Cond: ((project_id = 115) AND (project_id = 115))
>
> 106323 = 3709 + 102614
>
> Looks like the underlying problem is that the estimate for
> ((project_id = 115) AND (project_id = 115)) doesn't end up being the
> same as (project_id=115) on its own.
>
> Matt
>
> On Tue, Feb 26, 2013 at 11:35 AM, Matt Daw <matt(at)shotgunsoftware(dot)com> wrote:
>> Howdy, the query generator in my app sometimes creates redundant
>> filters of the form:
>>
>> project_id IN ( <list of projects user has permission to see> ) AND
>> project_id = <single project user is looking at >
>>
>> ... and this is leading to a bad estimate (and thus a bad plan) on a
>> few complex queries. I've included simplified examples below. This
>> server is running 9.0.10 and the statistics target has been updated to
>> 1000 on the project_id column. I've also loaded the one table into a
>> 9.2.2 instance and replicated the behaviour.
>>
>> I can change how the query is being generated, but I'm curious why I'm
>> getting a bad estimate. Is this an expected result?
>>
>> Thanks!
>>
>> Matt
>>
>> =============
>>
>> 1) Filter on project_id only, row estimate for Bitmap Index Scan quite good.
>>
>> explain (analyze,buffers) select count(id) from versions WHERE project_id=115;
>>
>> QUERY PLAN
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------
>> Aggregate (cost=1218111.01..1218111.02 rows=1 width=4) (actual
>> time=1531.341..1531.342 rows=1 loops=1)
>> Buffers: shared hit=452619
>> -> Bitmap Heap Scan on versions (cost=34245.06..1215254.86
>> rows=1142461 width=4) (actual time=148.394..1453.383 rows=1114197
>> loops=1)
>> Recheck Cond: (project_id = 115)
>> Buffers: shared hit=452619
>> -> Bitmap Index Scan on versions_project_id
>> (cost=0.00..33959.45 rows=1142461 width=0) (actual
>> time=139.709..139.709 rows=1116037 loops=1)
>> Index Cond: (project_id = 115)
>> Buffers: shared hit=22077
>> Total runtime: 1531.399 ms
>>
>> 2) Filter on project_id IN () AND project_id. Row estimate is ~10x lower.
>>
>> explain (analyze,buffers) select count(id) from versions WHERE
>> project_id IN (80,115) AND project_id=115;;
>>
>> QUERY PLAN
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>> Aggregate (cost=327066.18..327066.19 rows=1 width=4) (actual
>> time=1637.889..1637.889 rows=1 loops=1)
>> Buffers: shared hit=458389
>> -> Bitmap Heap Scan on versions (cost=3546.56..326793.17
>> rows=109201 width=4) (actual time=155.107..1557.453 rows=1114180
>> loops=1)
>> Recheck Cond: ((project_id = ANY ('{80,115}'::integer[])) AND
>> (project_id = 115))
>> Buffers: shared hit=458389
>> -> Bitmap Index Scan on versions_project_id
>> (cost=0.00..3519.26 rows=109201 width=0) (actual time=145.502..145.502
>> rows=1125436 loops=1)
>> Index Cond: ((project_id = ANY ('{80,115}'::integer[]))
>> AND (project_id = 115))
>> Buffers: shared hit=22076
>> Total runtime: 1637.941 ms