Query plan, nested EXISTS

Lists: pgsql-performance
From: Matt Daw <matt(at)shotgunsoftware(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Query plan, nested EXISTS
Date: 2012-09-28 21:04:04
Message-ID: CAA2LLOFzQDqnpvxxf1EAUzKXrJx5VUN3P2jgon3zTNzdtpGZQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Howdy, I've been debugging a client's slow query today and I'm curious
about the query plan. It's picking a plan that hashes lots of rows from the
versions table (on v9.0.10)...

EXPLAIN ANALYZE
SELECT COUNT(*) FROM notes a WHERE
a.project_id = 114 AND
EXISTS (
SELECT 1 FROM note_links b
WHERE
b.note_id = a.id AND
b.entity_type = 'Version' AND
EXISTS (
SELECT 1 FROM versions c
WHERE
c.id = b.entity_id AND
c.code ILIKE '%comp%' AND
c.retirement_date IS NULL
) AND
b.retirement_date IS NULL
)

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=833177.30..833177.31 rows=1 width=0) (actual
time=10806.416..10806.416 rows=1 loops=1)
-> Hash Semi Join (cost=747004.15..833154.86 rows=8977 width=0)
(actual time=10709.343..10806.344 rows=894 loops=1)
Hash Cond: (a.id = b.note_id)
-> Index Scan using notes_retirement_date_project on notes a
(cost=0.00..66725.10 rows=12469 width=4) (actual time=12.213..71.199
rows=12469 loops=1)
Index Cond: (project_id = 114)
-> Hash (cost=723749.35..723749.35 rows=1417424 width=4) (actual
time=10696.192..10696.192 rows=227261 loops=1)
Buckets: 65536 Batches: 4 Memory Usage: 2016kB
-> Hash Semi Join (cost=620007.75..723749.35 rows=1417424
width=4) (actual time=8953.460..10645.714 rows=227261 loops=1)
Hash Cond: (b.entity_id = c.id)
-> Seq Scan on note_links b (cost=0.00..71849.56
rows=1417424 width=8) (actual time=0.075..628.183 rows=1509795 loops=1)
Filter: ((retirement_date IS NULL) AND
((entity_type)::text = 'Version'::text))
-> Hash (cost=616863.62..616863.62 rows=251530
width=4) (actual time=8953.327..8953.327 rows=300115 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 10551kB
-> Seq Scan on versions c
(cost=0.00..616863.62 rows=251530 width=4) (actual time=176.590..8873.588
rows=300115 loops=1)
Filter: ((retirement_date IS NULL) AND
((code)::text ~~* '%comp%'::text))
Total runtime: 10810.479 ms
(16 rows)

However, I can trick it into a better plan by adding LIMIT 1 into the inner
EXISTS:

EXPLAIN ANALYZE
SELECT COUNT(*) FROM notes a WHERE
a.project_id = 114 AND
EXISTS (
SELECT 1 FROM note_links b
WHERE
b.note_id = a.id AND
b.entity_type = 'Version' AND
EXISTS (
SELECT 1 FROM versions c
WHERE
c.id = b.entity_id AND
c.code ILIKE '%comp%' AND
c.retirement_date IS NULL
LIMIT 1
) AND
b.retirement_date IS NULL
)

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=372820.37..372820.38 rows=1 width=0) (actual
time=139.430..139.430 rows=1 loops=1)
-> Nested Loop Semi Join (cost=0.00..372809.15 rows=4488 width=0)
(actual time=9.735..139.333 rows=894 loops=1)
-> Index Scan using notes_retirement_date_project on notes a
(cost=0.00..66725.10 rows=12469 width=4) (actual time=9.699..67.263
rows=12469 loops=1)
Index Cond: (project_id = 114)
-> Index Scan using note_links_note on note_links b
(cost=0.00..24.54 rows=1 width=4) (actual time=0.006..0.006 rows=0
loops=12469)
Index Cond: (b.note_id = a.id)
Filter: ((b.retirement_date IS NULL) AND
((b.entity_type)::text = 'Version'::text) AND (SubPlan 1))
SubPlan 1
-> Limit (cost=0.00..9.04 rows=1 width=0) (actual
time=0.003..0.003 rows=0 loops=11794)
-> Index Scan using versions_pkey on versions c
(cost=0.00..9.04 rows=1 width=0) (actual time=0.003..0.003 rows=0
loops=11794)
Index Cond: (id = $0)
Filter: ((retirement_date IS NULL) AND
((code)::text ~~* '%comp%'::text))
Total runtime: 139.465 ms
(13 rows)

Unfortunately, a couple other queries I tested got slower by adding the
LIMIT so I don't think that's going to be a good workaround. It doesn't
appear to be related to ILIKE, because I tried a straight equals against
another un-indexed column of versions and still get a slow plan (and adding
the LIMIT to this one made it fast too):

EXPLAIN ANALYZE
SELECT COUNT(*) FROM notes a WHERE
a.project_id = 114 AND
EXISTS (
SELECT 1 FROM note_links b
WHERE
b.note_id = a.id AND
b.entity_type = 'Version' AND
EXISTS (
SELECT 1 FROM versions c
WHERE
c.id = b.entity_id AND
c.sg_status_list = 'ip' AND
c.retirement_date IS NULL
) AND
b.retirement_date IS NULL
)

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=821544.18..821544.19 rows=1 width=0) (actual
time=5046.492..5046.492 rows=1 loops=1)
-> Hash Semi Join (cost=735371.03..821521.73 rows=8977 width=0)
(actual time=4941.968..5045.968 rows=7116 loops=1)
Hash Cond: (a.id = b.note_id)
-> Index Scan using notes_retirement_date_project on notes a
(cost=0.00..66725.10 rows=12469 width=4) (actual time=9.639..68.751
rows=12469 loops=1)
Index Cond: (project_id = 114)
-> Hash (cost=712116.23..712116.23 rows=1417424 width=4) (actual
time=4931.956..4931.956 rows=297401 loops=1)
Buckets: 65536 Batches: 4 Memory Usage: 2633kB
-> Hash Join (cost=620484.32..712116.23 rows=1417424
width=4) (actual time=3362.472..4864.816 rows=297401 loops=1)
Hash Cond: (b.entity_id = c.id)
-> Seq Scan on note_links b (cost=0.00..71849.56
rows=1417424 width=8) (actual time=0.079..622.277 rows=1509795 loops=1)
Filter: ((retirement_date IS NULL) AND
((entity_type)::text = 'Version'::text))
-> Hash (cost=618673.97..618673.97 rows=144828
width=4) (actual time=3362.337..3362.337 rows=155834 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 5479kB
-> HashAggregate (cost=617225.69..618673.97
rows=144828 width=4) (actual time=3289.861..3335.344 rows=155834 loops=1)
-> Seq Scan on versions c
(cost=0.00..616863.62 rows=144828 width=4) (actual time=217.080..3133.870
rows=155834 loops=1)
Filter: ((retirement_date IS NULL)
AND ((sg_status_list)::text = 'ip'::text))
Total runtime: 5051.414 ms
(17 rows)

Does anything come to mind that would help me debug why this plan is being
chosen? Thanks!

Matt


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Matt Daw <matt(at)shotgunsoftware(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Query plan, nested EXISTS
Date: 2012-09-28 21:44:32
Message-ID: 16018.1348868672@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Matt Daw <matt(at)shotgunsoftware(dot)com> writes:
> Howdy, I've been debugging a client's slow query today and I'm curious
> about the query plan. It's picking a plan that hashes lots of rows from the
> versions table (on v9.0.10)...

> EXPLAIN ANALYZE
> SELECT COUNT(*) FROM notes a WHERE
> a.project_id = 114 AND
> EXISTS (
> SELECT 1 FROM note_links b
> WHERE
> b.note_id = a.id AND
> b.entity_type = 'Version' AND
> EXISTS (
> SELECT 1 FROM versions c
> WHERE
> c.id = b.entity_id AND
> c.code ILIKE '%comp%' AND
> c.retirement_date IS NULL
> ) AND
> b.retirement_date IS NULL
> )

I think the real problem here is that 9.0 is incapable of avoiding a
full table scan on "note_links", which means it doesn't really have any
better option than to do the inner EXISTS as a full-table semijoin.
This is because it can't push a.id down through two levels of join, and
because the semijoins don't commute, there's no way to get a.id into the
scan of note_links to pull out only the useful rows. The hack with
LIMIT avoids this problem by preventing the inner EXISTS from being
treated as a full-fledged semijoin; but of course that hack leaves you
vulnerable to very bad plans if the statistics are such that a nestloop
join isn't the best bet for the inner EXISTS.

The work I did for parameterized paths in 9.2 was intended to address
exactly this type of scenario. I would be interested to know if 9.2
does this any better for you.

regards, tom lane


From: Matt Daw <matt(at)shotgunsoftware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Query plan, nested EXISTS
Date: 2012-09-28 21:47:47
Message-ID: CAA2LLOE6Wk1f_KRHiHQQj7kONXjBbG5U4f=Dwa=wU3mjj0SQqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi Tom, thank you very much. I'll load these tables onto a 9.2 instance and
report back.

Matt

On Fri, Sep 28, 2012 at 2:44 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Matt Daw <matt(at)shotgunsoftware(dot)com> writes:
> > Howdy, I've been debugging a client's slow query today and I'm curious
> > about the query plan. It's picking a plan that hashes lots of rows from
> the
> > versions table (on v9.0.10)...
>
> > EXPLAIN ANALYZE
> > SELECT COUNT(*) FROM notes a WHERE
> > a.project_id = 114 AND
> > EXISTS (
> > SELECT 1 FROM note_links b
> > WHERE
> > b.note_id = a.id AND
> > b.entity_type = 'Version' AND
> > EXISTS (
> > SELECT 1 FROM versions c
> > WHERE
> > c.id = b.entity_id AND
> > c.code ILIKE '%comp%' AND
> > c.retirement_date IS NULL
> > ) AND
> > b.retirement_date IS NULL
> > )
>
> I think the real problem here is that 9.0 is incapable of avoiding a
> full table scan on "note_links", which means it doesn't really have any
> better option than to do the inner EXISTS as a full-table semijoin.
> This is because it can't push a.id down through two levels of join, and
> because the semijoins don't commute, there's no way to get a.id into the
> scan of note_links to pull out only the useful rows. The hack with
> LIMIT avoids this problem by preventing the inner EXISTS from being
> treated as a full-fledged semijoin; but of course that hack leaves you
> vulnerable to very bad plans if the statistics are such that a nestloop
> join isn't the best bet for the inner EXISTS.
>
> The work I did for parameterized paths in 9.2 was intended to address
> exactly this type of scenario. I would be interested to know if 9.2
> does this any better for you.
>
> regards, tom lane
>


From: Matt Daw <matt(at)shotgunsoftware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Query plan, nested EXISTS
Date: 2012-09-28 22:56:10
Message-ID: CAA2LLOFkL2VepKAUsRUirfgJeT5P=SMUNntTOd194cR_1xooMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi Tom, v9.2.1 looks good!

Aggregate (cost=420808.99..420809.00 rows=1 width=0) (actual
time=147.345..147.345 rows=1 loops=1)
-> Nested Loop Semi Join (cost=0.00..420786.71 rows=8914 width=0)
(actual time=13.847..147.219 rows=894 loops=1)
-> Index Scan using notes_retirement_date_project on notes a
(cost=0.00..67959.22 rows=12535 width=4) (actual time=13.811..71.741
rows=12469 loops=1)
Index Cond: (project_id = 114)
-> Nested Loop Semi Join (cost=0.00..28.14 rows=1 width=4)
(actual time=0.006..0.006 rows=0 loops=12469)
-> Index Scan using note_links_note on note_links b
(cost=0.00..12.37 rows=1 width=8) (actual time=0.002..0.002 rows=1
loops=12469)
Index Cond: (note_id = a.id)
Filter: ((retirement_date IS NULL) AND
((entity_type)::text = 'Version'::text))
Rows Removed by Filter: 1
-> Index Scan using versions_pkey on versions c
(cost=0.00..15.76 rows=1 width=4) (actual time=0.003..0.003 rows=0
loops=11794)
Index Cond: (id = b.entity_id)
Filter: ((retirement_date IS NULL) AND ((code)::text
~~* '%comp%'::text))
Rows Removed by Filter: 1
Total runtime: 147.411 ms
(14 rows)

On Fri, Sep 28, 2012 at 2:47 PM, Matt Daw <matt(at)shotgunsoftware(dot)com> wrote:

> Hi Tom, thank you very much. I'll load these tables onto a 9.2 instance
> and report back.
>
> Matt
>
>
> On Fri, Sep 28, 2012 at 2:44 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Matt Daw <matt(at)shotgunsoftware(dot)com> writes:
>> > Howdy, I've been debugging a client's slow query today and I'm curious
>> > about the query plan. It's picking a plan that hashes lots of rows from
>> the
>> > versions table (on v9.0.10)...
>>
>> > EXPLAIN ANALYZE
>> > SELECT COUNT(*) FROM notes a WHERE
>> > a.project_id = 114 AND
>> > EXISTS (
>> > SELECT 1 FROM note_links b
>> > WHERE
>> > b.note_id = a.id AND
>> > b.entity_type = 'Version' AND
>> > EXISTS (
>> > SELECT 1 FROM versions c
>> > WHERE
>> > c.id = b.entity_id AND
>> > c.code ILIKE '%comp%' AND
>> > c.retirement_date IS NULL
>> > ) AND
>> > b.retirement_date IS NULL
>> > )
>>
>> I think the real problem here is that 9.0 is incapable of avoiding a
>> full table scan on "note_links", which means it doesn't really have any
>> better option than to do the inner EXISTS as a full-table semijoin.
>> This is because it can't push a.id down through two levels of join, and
>> because the semijoins don't commute, there's no way to get a.id into the
>> scan of note_links to pull out only the useful rows. The hack with
>> LIMIT avoids this problem by preventing the inner EXISTS from being
>> treated as a full-fledged semijoin; but of course that hack leaves you
>> vulnerable to very bad plans if the statistics are such that a nestloop
>> join isn't the best bet for the inner EXISTS.
>>
>> The work I did for parameterized paths in 9.2 was intended to address
>> exactly this type of scenario. I would be interested to know if 9.2
>> does this any better for you.
>>
>> regards, tom lane
>>
>
>